ReCo! finds gRNA read counts (ReCo) in fastq files and runs as a standalone script or a python package. It can be used for single and combinatorial CRISPR-Cas libraries that have been sequenced with single-end or paired-end sequencing strategies. ReCo works with conventionally cloned CRISPR-Cas libraries and 3Cs/3Cs-MPX libraries.
Provided with only a sample sheet and gRNA sequences, ReCo can process multiple samples in a single run. It automatically determines the constant regions flanking the gRNAs, and utilizes Cutadapt to trim the fastq files. To determine the abundance of each gRNA, the resulting sequences are aligned to the gRNA library using Bowtie2.
If ReCo! is additionally provided with an appropriate 3Cs or 3Cs-MPX template vector map (, e.g., in Snapgene format or as a .fasta file), it automatically quantifies the abundance of the restriction enzyme recognition site of the template placeholder sequence.
ReCo! provides the read counts as a .csv file and generates a graphical summary of library statistics in .pdf and .png format. For trouble shooting of unexpected results, ReCo provides detailed log files and reports all sequences that did not align to the library.
ReCo was developed by Martin Wegner in Prof. Manuel Kaulich's group (GEG) at the Institute of Biochemistry II (IBCII) in the Hospital of the Goethe University, Frankfurt am Main, Germany and is available under the terms of the MIT license.
ReCo! was developed on Ubuntu 20.0.4 LTS using Python 3.8. To set up Python3 on your machine, please visit the Python3 documentation and follow the installation instructions. Trimming is performed with Cutadapt 2.8 which is included in Ubuntu distributions and was therefore used for fastq trimming. Aligning the fastq sequences to your gRNA library of interest is performed with Bowtie2 2.3.0. Newer versions seem to produce unexpected results. I am currently looking into this.
Both, cutadapt and Bowtie2, must be available from the system PATH.
If you use ReCo!, please cite Cutadapt, Bowtie2, and ReCo!:
Cutadapt: https://doi.org/10.14806/ej.17.1.200
Bowtie2: https://doi.org/10.1038/nmeth.1923
ReCo!: https://academic.oup.com/bioinformatics/article/39/8/btad448/7229558
Table of Contents
Install directly from GitHub but make sure to have Cutadapt 2.8 and Bowtie2 2.3.0 available on your system:
pip install git+https://github.com/MaWeffm/reco.git#egg=reco
ReCo works both as a script that can be invoked from the command line and as a python package.
Use ReCo from the command line:
$ ReCo cli --s sample_sheet.xlsx --o /home/user/ReCo_output/ -j 15 -r
--s
: path to the sample sheet (required)
--o
: path to output dir (required)
-j 15
: use 15 cores (optional, default is 1)
-r
: remove all intermediate files upon successful termination (optional, default is True)
Import ReCo in Python and print its version:
>>> import reco
>>> reco.__version__
'0.0.1'
Create a ReCo object, provide a sample sheet file, an output dir, set logging and multiprocessing options. Run and remove all unnecessary files:
>>> r = reco.ReCo(sample_sheet_file="sample_sheet.xlsx", output_dir="/home/user/reco_output/")
>>> r.run(remove_unused_files=True, cores=15)
2022-08-22 20:49:34 INFO: Starting ReCo 0.0.1 at 2022-08-22 20:49:34
2022-08-22 20:49:35 INFO: Sample 1: OK!
2022-08-22 20:49:35 INFO: Sample 2: OK!
2022-08-22 20:49:35 INFO: Sample 3: OK!
2022-08-22 20:49:35 INFO: Sample 4: OK!
...
2022-08-22 21:22:23 INFO: Finished: 2022-08-22 21:22:23 (in: 0:32:48.165831)
The sample sheet contains all samples and can be in .xlsx, .csv, .tsv., or .txt format. In .csv files, the field separator must be a comma. In .tsv and .txt files the field separator must be a tab (\t).
The first row of the sample sheet file must be a header shown as above. After that, each row represents a sample. The first column (Sample name) contains the sample name. Try to use meaningful names, your future you will be grateful! The second column (Sample type) contains the type of sample. A single sample requires one fastq file and one library file. A paired sample requires two fastq files as a result from paired-end sequencing, and two library files. The third column (Vector) contains the path to a vector file in one of the following formats: .dna, .gb, .gbk., .fa, .fasta, or .txt. The 4. and 5. columns (FastQ read 1, FastQ read 2)contain paths to fastq files. The fastq files can be read compressed (.fasta.gz) or uncompressed (.fasta). For a sample of type single, use one of the columns only. The 6. and 7. columns (Lib 1, Lib 2) contain paths to library files in one of the following formats: .xlsx, .csv, .tsv, .txt. For a sample of type single, use one of the columns only. The 8. column (Expected reads) contains the expected number of reads. The last column (Emails) can optionally contain a list of email addresses to which the results are send.
The library file contains all gRNA sequences for a sample.
It must not contain a header. Each row represents a gRNA. The first column contains the unique gRNA name. The second column contains the gRNA sequence. All gRNA sequences must be notated in the same direction (forward or reverse). In case of duplicated names or sequences, ReCo will automatically keep only the first occurrence and log a warning.
The vector file is optional and contains template vector information in one of the following formats: .dna (Snapgene), .fasta, .fa, .gb, .gbk, or .txt. If a vector file is provided, ReCo assumes that this is a 3Cs template vector and tries to find the template restriction enzyme recognition site to quantify its abundance (see 3Cs and 3Cs-MPX for details). If a DNA sequence is provided containing the letters A, C, G, and T, ReCo will try to find template information in this sequence. If left empty, ReCo assumes that the library was generated conventionally and skips determining the template sequence from the vector file.
ReCo generates a log file in the specified output folder summarizing all runs:
/output_dir/reco_date.log
For each sample, ReCo creates a sub folder in the specified output folder and generates multiple result files:
/output_dir/sample_name/report.txt
Provides a summary of all important parameters and trimming/alignment rates.
/output_dir/sample_name/ReCo_[samplename].log
A detailed logfile containing all parameters, settings, outputs (also from cutadapt and Bowtie2). Helpful for trouble shooting in case of unexpected results.
/output_dir/sample_name/[samplename]_final_guidecounts.csv
This is the file containing the read counts for all library gRNAs or gRNA combinations of two libraries.
/output_dir/sample_name/[samplename]_failed_gRNAs.csv
This file contains all sequences that ReCo could not align to the library. Helpful for trouble shooting.
/output_dir/sample_name/[samplename]_top100_failed_sequences.csv
This file contains only the top 100 of sequences that ReCo could not align to the library. This is a small file that is useful for quick trouble shooting. If trimming or alignment rates are low, try to align these sequences to other libraries or double check the homology sequence that ReCo determined from your fastq files.
/output_dir/sample_name/[samplename]_qc_panel.pdf
and[samplename]_qc_panel_png
These two files contain a plot panel visualizing properties of the sequenced library.