FastqWiper
is a Python application that wipes out badly formatted reads from readable FASTQ files.
More complex workflows, as recover corrupted fastq.gz
, dropping or fixing pesky lines, removing
unpaired reads, and fixing reads interleaving, can be executed using Snakemake and the preconfigured
pipeline files provided here.
- Compatibility: Python <3.9
- OS: Windows (excluding pipelines), Linux, Mac OS
- Contributions: [email protected]
- Pypi: https://pypi.org/project/fastqwiper
- Conda: https://anaconda.org/bfxcss/fastqwiper
- Docker Hub: available soon
- Bug report: https://github.com/mazzalab/fastqwiper/issues
FastqWiper
alone can be installed using both Conda and PyPi and runs smoothly on all OS specified above.
Create and activate an empty Conda environment, if not already available.
$ conda create -n FastqWiper python=3.8
$ conda activate FastqWiper
then
$ conda install -y -c bfxcss -c conda-forge fastqwiper
pip install fastqwiper
fastqwiper
<options>
options:
--fastq_in TEXT The input FASTQ file to be cleaned [required]
--fastq_out TEXT The wiped FASTQ file [required]
--log_frequency INTEGER The number of processed reads that you want to print a status message after
It accepts in input and outputs readable *.fastq
or *.fastq.gz
files.
To enable the use of preconfigured pipelines, you need to install Snakemake. The recommended way to install Snakemake is via Conda, because it enables Snakemake to handle software dependencies of your workflow. However, the default conda solver is slow and often hangs. Therefore, we recommend installing Mamba as a drop-in replacement via
$ conda install -c conda-forge mamba
and then creating and activating a clean environment as above:
$ mamba create -c conda-forge -c bioconda -n FastqWiper snakemake
$ conda activate FastqWiper
$ conda install colorama click
$ conda install mamba -c conda-forge
Clone the FastqWiper repository:
git clone https://github.com/mazzalab/fastqwiper.git
.
It contains, in particular, a folder data
containing the fastq files to be processed, a folder pipeline
containing
the released pipelines and a folder fastq_wiper
with the source files of FastqWiper.
Input files to be processed should be copied into the data folder. All software packages not fetched from Conda
and used by the pipelines should be copied, even if it is not strictly mandatory, in the root directory of the cloned
repository.
Currently, to run the FastqWiper pipelines, the following packages are not included in Conda
but are
required:
gzrt (install instructions)
BBTools (install instructions)
$ cd fastqwiper
$ git clone https://github.com/arenn/gzrt.git
$ cd gzrt
$ make
$ cd ..
$ cd fastqwiper
$ tar -xvzf BBMap_(version).tar.gz
- Personalize a pipeline. Using
fix_wipe_pairs_reads.smk
requires you to edit line 3 of the file with the name of the fastq files stored indata
folder that you want to process. If the files were:
excerpt_S1_R1_001.fastq.gz
excerpt_S1_R2_001.fastq.gz
sample_S1_R1_001.fastq.gz
sample_S1_R2_001.fastq.gz
the SAMPLE vector should be: SAMPLES = ["sample", "excerpt"]
-
Get a dry run of a pipeline (e.g.,
fix_wipe_pairs_reads.smk
):
snakemake -s pipeline/fix_wipe_pairs_reads.smk --use-conda --cores 2 -np
-
Generate the planned DAG:
snakemake -s pipeline/fix_wipe_pairs_reads.smk --dag | dot -Tpdf > dag.pdf
- Run the pipeline (n.b., during the first execution, Snakemake will download and install some required remote
packages and may take longer). The number of computing cores can be tuned accordingly:
snakemake -s pipeline/fix_wipe_pairs_reads.smk --use-conda --cores 2
Fixed files will be copied in the data
folder and will be suffixed with the string _fixed_wiped_paired_interleaving
.
We remind that the fix_wipe_pairs_reads.smk
pipeline performs the following actions:
- execute
gzrt
on corrupted fastq.gz files (i.e., that cannot be unzipped because of errors) and recover readable reads; - execute
fastqwiper
on recovered reads to make them compliant with the FASTQ format (source: Wipipedia) - execute
Trimmomatic
on wiped reads to remove residual unpaired reads - execute
BBmap (repair.sh)
on paired reads to fix the correct interleaving and sort fastq files.
Using fix_wipe_pairs_reads.smk
requires you to make the same edits as above. This pipeline will not execute
trimmomatic
and BBmap's repair.sh
.
-
Get a dry run of a pipeline (e.g.,
fix_wipe_single_reads.smk
):
snakemake -s pipeline/fix_wipe_single_reads.smk --use-conda --cores 2 -np
-
Generate the planned DAG:
snakemake -s pipeline/fix_wipe_single_reads.smk --dag | dot -Tpdf > dag.pdf
- Run the pipeline (n.b., during the first execution, Snakemake will download and install some required remote
packages and may take longer). The number of computing cores can be tuned accordingly:
snakemake -s pipeline/fix_wipe_single_reads.smk --use-conda --cores 2
Laboratory of Bioinformatics
Fondazione IRCCS Casa Sollievo della Sofferenza
Viale Regina Margherita 261 - 00198 Roma IT
Tel: +39 06 44160526 - Fax: +39 06 44160548
E-mail: [email protected]
Web page: http://www.css-mendel.it
Web page: http://bioinformatics.css-mendel.it