- revisited unitig mode with improved filter for repeat regions
- dazzling proovread - experimental support for DALIGNER as unitig mapper (faster/more sensitive than blasr)
- much faster and more sensitive due to bwa mem as default mapper
- increased speed and contiguity by using unitigs in correction
- optimized for HiSeq & MiSeq reads
- compressed BAM intermediates
- efficient threading up to 20 cores and more
- …
# fresh install git clone --recursive https://github.com/BioInf-Wuerzburg/proovread cd proovread/util/bwa make # update existing install to latest version cd proovread/ git pull git submodule update --recursive cd /util/bwa make clean # important for changes to bwa source code taking effect make
NOTE: proovread comes with its own, modified version of bwa. Using it with a standard bwa built will fail.
- Perl 5.10.1 or later
- Log::Log4perl
- File::Which (see #17)
- NCBI Blast-2.2.24+ or later
- samtools-1.1 or later
proovread is distributed ready with binaries of SHRiMP2 and BLASR. If you want to employ your own installed version of these mappers, have a look at the Advanced Configuration section.
Test your installation by running proovread on the included sample data set.
proovread --sample --pre pr-sample
Don’t run proovread on entire SMRT cells directly, it will only blast your memory and take forever. Split your data in handy chunks of a few Mbp first:
# located in /path/to/proovread/bin SeqChunker -s 20M -o pb-%03d.fq pb-subreads.fq
Run proovread on one chunk first.
proovread -l pb-001.fq -s reads.fq [-u unitigs.fa] --pre pb-001
If things go smoothly, submit the rest. See Log and statistics for some notes on how to interpret logging information.
Just to give you a hands-on idea of what to expect from correction. Here are some stats of the latest correction I ran. It’s from one PacBio cell of a 50Mb heterozygous eukaryote genome (I will add some more numbers on other data sets and correction tools soon)
raw | proovread | lordec | |
---|---|---|---|
Sequences | 56,877 | 55,493 | 53,676 |
Total (bp) | 315,511,633 | 236,687,413 | 99,379,617 |
Longest (bp) | 27,874 | 24,682 | 13,917 |
Shortest (bp) | 1,000 | 1,000 | 1,000 |
N50 (bp) | 7,592 | 6,236 | 1,877 |
N90 (bp) | 2,887 | 1,934 | 1,141 |
Primarily proovread has been designed to correct PacBio subreads. You get
these reads either from PacBio’s SMRT-Portal or by using dextract
from Gene
Myers PacBio assembler DAZZLER, which I would recommend.
In general, reads can be provided in FASTQ or FASTA format. Quality information
is used, but only has minor advantages. More valuable are subread information
given in default PacBio IDs, which if available are utilized by proovreads
ccseq
module to improve correction performance. Reads shorter then 2x the mean
short read length will be ignored.
It is also possible to feed other types of erroneous sequences to proovread, e.g. contigs, 454 reads, … However, keep in mind that the alignment model for mappings has been optimized for PacBio reads and may produce artifacts in other scenarios. We are currently working on a version optimized for Oxford Nanopore data.
For correction of long reads, proovread needs high coverage short read data. Typically these are HiSeq (75-150bp) and MiSeq reads (200-300bp), with overlapping libraries merged (FLASh) for best performance. But also 454 or PacBio CCS reads can be used.
Reads need to have FASTQ/A format and may differ in length. Pairing information are not used. Use of quality trimmed or error corrected reads can improve results.
The recommended coverage for short reads data is around 30-50X and should be
specified with --coverage
. If you have less coverage, it is definitely still
worth running proovread. However, it is likely that contiguity will suffer.
Internally, proovread will sample subsets for different iterations, by default
15X for initial runs, 30X for the finishing. For customization of these rates
see sr-coverage
in proovread’s config (Advanced Configuration).
In addition to short reads, unitigs can/should be used for correction in particular for large data sets (eukaryotes). Unitigs are high-confidence assembly fragments produced by for example ALLPATHS, Meraculous2 or the Celera Assembler. In contrast to contigs, unitigs don’t extend past any conflict in the underlying short read data, making them highly reliable.
There are two huge advantages of using pre-computed unitigs:
- Contiguity: unitigs are longer then corresponding short reads, which makes them easier to align and give better chances to also correct difficult regions.
- Speed: During unitig computation, all redundancy is removed from the data, creating a minimal set which can be aligned much faster.
However, unitigs only cover regions without conflicts in short read data space. To correct PacBio reads in full length these gaps need to be corrected with primary short read data.
Currently, support for DAZZLER/DALIGNER is considered experimental. To use
dazzler instead of blasr, either export paths or set daligner-path
and
dazz-db-path
in the config and invoke with modes
sr+dazz-utg / mr+dazz-utg
. In the current implementation, only a single
instance of dazzler will be invoked, therefore threading is determined by the
thread setup with which daligner has been compiled (default 4).
Since proovread is designed to operate on BAM/SAM, for the time being, daligner
output is internally converted to SAM using a simple parser script
(dazz2sam
). This script also works as a stand-alone tool for dazzler-to-SAM
conversion (proovread/bin/dazz2sam --help
), which might come in handy if one
wants to visualize dazzler mappings in common alignment viewers like IGV or
tablet.
# extract unitigs from allpaths assembly allpathslg/bin/Fastb2Fasta IN=reads.unibases.k96 OUT=unitigs.fa
By default, proovread generates six files in the output folder:
.trimmed.f[aq] | high accuracy pacbio reads, trimmed for uncorrected/low quality regions |
.untrimmed.fq | complete corrected pacbio reads including un-/ poorly corrected regions |
.ignored.tsv | ids of reads and the reason for excluding them from correction |
.chim.tsv | annotations of potential chimeric joints clipped during trimming |
.parameter.log | the parameter set used for this run |
If you are interested in mappings (BAM) and other intermediary files from
iterations have a look at --keep-temporary
.
The phred scores produced by proovread derive from short read support of each base during correction. The values are scaled to realistically mimic sequencing phred accuracies:
Phred | Accuracy | p33 |
---|---|---|
40 | 99.99 | I |
30 | 99.90 | ? |
20 | 99.00 | 5 |
10 | 90.00 | + |
proovread generates a comprehensive log on STDERR. The includes fully functional system calls for scripts/tools invoked by proovread. That way, if something goes wrong, its easy to rerun a certain task individually and take a closer look on the issue.
If you want to analyze, how things are going and whether there might be problems
with sensitivity etc., the most important information is Masked: xx%
after
each iteration.
grep -P 'Running mode|ked :|ning task' proovread.log [Mon Jan 26 09:52:05 2015] Running mode: blasr-utg [Mon Jan 26 09:52:51 2015] Running task blasr-utg [Mon Jan 26 10:00:32 2015] Masked : 55.3% [Mon Jan 26 10:00:32 2015] Running task bwa-mr-1 [Mon Jan 26 10:21:45 2015] Masked : 76.2% [Mon Jan 26 10:28:14 2015] Running task bwa-mr-2 [Mon Jan 26 10:37:55 2015] Masked : 92.2% [Mon Jan 26 10:39:46 2015] Running task bwa-mr-finish [Mon Jan 26 10:51:19 2015] Masked : 93.0%
Masked regions are regions that have already been corrected at high confidence, minus some edge fraction, which remains unmasked in order to serve as seeds for subsequent iterations. After the first iteration, you should have a masking percentage > 50-75%, strongly depending on quality, type and coverage of your data. With each iteration, this value should increase.
Prior to the final iteration, all data is unmasked and the final iteration is run with strict settings on entirely unmasked data. The obtained percentage can be slightly lower as in the last iteration, and is roughly equal to the amount of read bases that will make it to high-confidence .trimmed.fq output.
proovread comes with a comprehensive configuration, which allows tuning down to
the algorithms core parameters. A custom configuration template can be generated
with --create-cfg
. Instructions on format etc. can be found inside the
template file.
proovread has been designed with low memory node cluster architectures in mind. Peek memory is mainly controlled by the amount of long reads provided. With chunks of less than 20 Mbp it easily runs on a 8 GB RAM machine.
In theory, proovread can be simply parallelized by increasing
--threads
. However, there are single thread steps and other bottlenecks, which
at some point render it more efficient, to run e.g. 4 instances at 8 threads in
parallel to make full use of a 32 CPU machine.
One might expect that proovread results are deterministic - meaning reproducible in identical form if input data is identical. This, however, is not the case in a couple of steps:
bwa employs heuristics that allow for slightly different
results in repeated runs. In particular, one feature is prone to generate
differences when employed in proovread’s iterative strategy: for performance
reasons bwa encodes nucleotides using 2 bits only, meaning bwa only has a
four letter alphabet [ATGC]
. Other bases, including NNNN
stretches used
for masking by proovread, are converted into random [ATGC]
strings. This,
in particular, effects alignments at the margins of masked regions:
orig | ATGAATTGGTTAATCTGC masked | ATGAATTGGTNNNNNNNN read | AATTGGTTAAT | rand-01 | ATGAATTGGTAGCCATGG | ||||||| aln-01 | AATTGGT | rand-02 | ATGAATTGGTTTATCTGC | |||||||| || aln-02 | AATTGGTTAAT
Whenever there are decisions to make for sorted list in combination with fixed amount of items to keep/remove, things get non-deterministic if identical values in sorting fields occur. In proovread, this for example affects filtering of “best alignments” in bins (localized scoring context).
50-50 ratios in base calling will result in one randomly chosen alternative, minimizing a particular bias.
Algorithm and Implementation are described in detail in the proovread paper.
file:media/proovread-poster.pdf
view proovread mechanism poster
proovread does local score comparison, rather than using a single hard
cut-off. bwa-proovread is modified in the same fashion. proovread.[ch]
extend
bwa with an implementation of proovread’s binning algorithm. Reporting of
alignments is determined by score-comparison within bins. That way repeat
alignments are filtered early on, increasing performance and largely reducing
disk space requirements.
If you use proovread, please cite:
proovread: large-scale high accuracy PacBio correction through iterative short read consensus. Hackl, T.; Hedrich, R.; Schultz, J.; Foerster, F. (2014).
Please, also recognize the authors of software packages, employed by proovread:
Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Li H. (2012) (bwa)
Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. Mark J Chaisson; Glenn Tesler. (2012)
SHRiMP: Accurate Mapping of Short Color-space Reads. Stephen M Rumble; Phil Lacroute; Adrian V. Dalca; Marc Fiume; Arend Sidow; Michael Brudno. (2009)
If you have any questions, encounter problems or potential bugs, don’t hesitate to contact us. Either report issues on github or write an email to:
- Thomas Hackl - [email protected]
- Frank Foerster - [email protected]