Skip to content

A workflow to recover corrupted fastq.gz files, drop wrong lines and remove unpaired reads

License

Notifications You must be signed in to change notification settings

bioinformatics-lab/fastqwiper

 
 

Repository files navigation

FastqWiper

Build status GitHub issues

FastqWiper is a Python application that wipes out badly formatted reads from readable FASTQ files.

More complex workflows, as recover corrupted fastq.gz, dropping or fixing pesky lines, removing unpaired reads, and fixing reads interleaving, can be executed using Snakemake and the preconfigured pipeline files provided here.

Installation

FastqWiper alone can be installed using both Conda and PyPi and runs smoothly on all OS specified above.

Anaconda or Miniconda

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

Create and activate an empty Conda environment, if not already available.

$ conda create -n FastqWiper python=3.8
$ conda activate FastqWiper

then
$ conda install -y -c bfxcss -c conda-forge fastqwiper

Pypi

PyPI version

pip install fastqwiper

Usage

fastqwiper <options>

options:
  --fastq_in TEXT          The input FASTQ file to be cleaned  [required]
  --fastq_out TEXT         The wiped FASTQ file                [required]
  --log_frequency INTEGER  The number of processed reads that you want to print a status message after

It accepts in input and outputs readable *.fastq or *.fastq.gz files.

Snakemake

To enable the use of preconfigured pipelines, you need to install Snakemake. The recommended way to install Snakemake is via Conda, because it enables Snakemake to handle software dependencies of your workflow. However, the default conda solver is slow and often hangs. Therefore, we recommend installing Mamba as a drop-in replacement via

$ conda install -c conda-forge mamba

and then creating and activating a clean environment as above:

$ mamba create -c conda-forge -c bioconda -n FastqWiper snakemake
$ conda activate FastqWiper
$ conda install colorama click
$ conda install mamba -c conda-forge

Usage

Clone the FastqWiper repository:

git clone https://github.com/mazzalab/fastqwiper.git.

It contains, in particular, a folder data containing the fastq files to be processed, a folder pipeline containing the released pipelines and a folder fastq_wiper with the source files of FastqWiper.
Input files to be processed should be copied into the data folder. All software packages not fetched from Conda and used by the pipelines should be copied, even if it is not strictly mandatory, in the root directory of the cloned repository.

Currently, to run the FastqWiper pipelines, the following packages are not included in Conda but are required:

required packages:

gzrt (install instructions)

BBTools (install instructions)

$ cd fastqwiper
$ git clone https://github.com/arenn/gzrt.git
$ cd gzrt
$ make
$ cd ..
$ cd fastqwiper
$ tar -xvzf BBMap_(version).tar.gz

Commands:

Paired-end files

  • Personalize a pipeline. Using fix_wipe_pairs_reads.smk requires you to edit line 3 of the file with the name of the fastq files stored in data folder that you want to process. If the files were:
excerpt_S1_R1_001.fastq.gz
excerpt_S1_R2_001.fastq.gz
sample_S1_R1_001.fastq.gz
sample_S1_R2_001.fastq.gz

the SAMPLE vector should be: SAMPLES = ["sample", "excerpt"]

  • Get a dry run of a pipeline (e.g., fix_wipe_pairs_reads.smk):
    snakemake -s pipeline/fix_wipe_pairs_reads.smk --use-conda --cores 2 -np

  • Generate the planned DAG:
    snakemake -s pipeline/fix_wipe_pairs_reads.smk --dag | dot -Tpdf > dag.pdf

  • Run the pipeline (n.b., during the first execution, Snakemake will download and install some required remote packages and may take longer). The number of computing cores can be tuned accordingly:
    snakemake -s pipeline/fix_wipe_pairs_reads.smk --use-conda --cores 2

Fixed files will be copied in the data folder and will be suffixed with the string _fixed_wiped_paired_interleaving. We remind that the fix_wipe_pairs_reads.smk pipeline performs the following actions:

  • execute gzrt on corrupted fastq.gz files (i.e., that cannot be unzipped because of errors) and recover readable reads;
  • execute fastqwiper on recovered reads to make them compliant with the FASTQ format (source: Wipipedia)
  • execute Trimmomatic on wiped reads to remove residual unpaired reads
  • execute BBmap (repair.sh) on paired reads to fix the correct interleaving and sort fastq files.

Single-end files

Using fix_wipe_pairs_reads.smk requires you to make the same edits as above. This pipeline will not execute trimmomatic and BBmap's repair.sh.

  • Get a dry run of a pipeline (e.g., fix_wipe_single_reads.smk):
    snakemake -s pipeline/fix_wipe_single_reads.smk --use-conda --cores 2 -np

  • Generate the planned DAG:
    snakemake -s pipeline/fix_wipe_single_reads.smk --dag | dot -Tpdf > dag.pdf

  • Run the pipeline (n.b., during the first execution, Snakemake will download and install some required remote packages and may take longer). The number of computing cores can be tuned accordingly:
    snakemake -s pipeline/fix_wipe_single_reads.smk --use-conda --cores 2

Author

Tommaso Mazza
Tweeting

Laboratory of Bioinformatics
Fondazione IRCCS Casa Sollievo della Sofferenza
Viale Regina Margherita 261 - 00198 Roma IT
Tel: +39 06 44160526 - Fax: +39 06 44160548
E-mail: [email protected]
Web page: http://www.css-mendel.it
Web page: http://bioinformatics.css-mendel.it

About

A workflow to recover corrupted fastq.gz files, drop wrong lines and remove unpaired reads

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.9%
  • Other 1.1%