Diginorm in context of reference based transcriptome analysis #69

drtamermansour · 2015-11-12T01:19:28Z

I am summarizing my experince with digital normalization in reference based RNAseq analysis and possible future directions. I went through 3 use cases:

Diginorm for a group of samples in one experiment with k 20 and coverage 20x followed by using all output reads as one sample to make one transcriptome using Cufflinks. I compare the results to the output of classical Cufflinks pipeline where we assemble a transcriptome for each sample then merge all into one final transcriptome using cuffmerge. I got 2 main observations:
- Good thing: We catch transcripts with diginorm that we miss with classical cufflinks pipeline. Apparently, these are low abundant transcripts that we enrich for by digital normalization. Theoretically we can obtain the same results by merging all input reads to go through Cufflinks as a single sample, however this is computationally unfeasible once you reach couple GBs of sequencing. Most probably this is not important for differential expression analysis however I would assume that this will be extremely important for non-coding RNA analysis (I will explain later)
- Bad thing: We miss a lot of exon-exon junctions of highly covered genes because we trash those important reads versus enrichment of reads from primary transcripts
Diginorm of samples belonging to the same experiment (as in no 1) at 10x coverage only then pooling all reads from all experiments (total 8 experiments). This is a scaling up of the first experiment. This integrative analysis is almost impossible without an idea like diginorm. You can view the resultant assembly by following this link. Yes, we miss genes found by other assemblies but we find new genes/isoforms and extended the UTRs of several gene models. Simply it is a new way of visualizing the data. Of course we need to add specific filters to remove predicted primary transcripts (I am working already on something) but I think this can turn into something really useful.
Diginorm of single deep sequence samples at k 20 and C 200x. YES, 200x, I mean it. The output is AMAZING. With classical Cufflinks, a 20GB sample failed to finish (several times) after 7 days with 32 processors. After 200x normalization of mapped reads (~16GB) which retain ~ 40% of the reads, it took cufflinks 13 min to finish the job. The assembly is very comparable to the other samples in the same experiment made by typical pipeline. This needs to be repeated with samples that I know can finish cufflinks where I can compare the effect of diginorm on the same sample at different coverages. You can easily see how this would cut the analysis time of RNAseq assemblies specially with the expected coming deep coverage experiments. Two technical points should be highlighted here:
- With the current implementation of diginorm, we require huge RAM for these deep sequence samples adding a useless computational limitation while reference based experiments "usually" do not require that much RAM. As I discussed before with @ctb, an implementation of a RAM conservative data structure would enhance the application of diginorm in many situations.
- Diginorm favors retention of unmappable reads so it is much better to do the diginorm on mapped reads which means adding extra step for extracting mapped reads into fastq files then re-mapping after diginorm. We need diginorm to read BAM(or SAM) files to save these 2 step. I remember discussions around making a reference based diginorm which could be another option but I one more reason to encourage developing a BAM parser is to allow selective retention of exon junctions reads (which can be recognized by the CIGAR string in the BAM file). This would solve the problem seen in point 1 and 2 and might allow diginorm to perform a completely unfeasible integrative analysis.

drtamermansour · 2016-10-04T00:38:11Z

To follow up on this idea, I decided to find a deeply sequenced RNA sample to test the time and computational resource required for analysis using the new HISAT2/StringTie softwares that replace Tophat2/Cufflinks. The aim was to document how diginorm can improve such a process.

Materials:

I searched the ENCODE project for a deeply sequenced sample. I found a sample prepared from the cytoplasmic fraction of independent growths of cell line SK-N-SH. It is a PE101 Illumina Hi-Seq RNA-Seq library from rRNA-depleted Poly-A+ RNA. The sample has 241,123,024 PE reads.
https://www.encodeproject.org/files/ENCFF000INM/@@download/ENCFF000INM.fastq.gz
https://www.encodeproject.org/files/ENCFF000INT/@@download/ENCFF000INT.fastq.gz

Analysis pipeline

Adaptor trimming with Trimmomatic software v0.36
(ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:30:10:1 SLIDINGWINDOW:4:2 MINLEN:50)
Alignment to the human genome using HISAT2
The genome index was obtained from the HISAT2 website. A version that includes genomic variants and transcriptome sequences from Ensembl GRCh38 was selected.
Transcriptome assembly using StringTie

Results

Trimmomatic output
Input Read Pairs: 241123024 Both Surviving: 239209323 (99.21%) Forward Only Surviving: 1785249 (0.74%) Reverse Only Surviving: 97734 (0.04%) Dropped: 30718 (0.01%)
HISAT output
239209323 reads; of these:
239209323 (100.00%) were paired; of these:
30764268 (12.86%) aligned concordantly 0 times
145565179 (60.85%) aligned concordantly exactly 1 time
62879876 (26.29%) aligned concordantly >1 times

30764268 pairs aligned concordantly 0 times; of these:
674749 (2.19%) aligned discordantly 1 time

30089519 pairs aligned 0 times concordantly or discordantly; of these:
60179038 mates make up the pairs; of these:
39535687 (65.70%) aligned 0 times
13289289 (22.08%) aligned exactly 1 time
7354062 (12.22%) aligned >1 times
91.74% overall alignment rate
Computational requirement

fields	Time	cores	RAM
HISAT2	2:28:23	8	8GB
StringTie	0:35:25	8	2.2GB

Conclusion

With whatever confidence we can make from one sample, we can conclude that the new software pipeline HISAT2/StringTie is way more faster with much less memory requirement than their predecessors. I do not think diginorm can add much here (the interleave and compression of the 2 reads to fit the input of khmer scripts took more than 18 hours !!) and ths I decided to stop this here .

This was referenced Nov 18, 2015

Look into on-disk structures for diginorm k-mer counting dib-lab/khmer#1327

Open

Accept BAM files of unaligned reads in addition to FAST[AQ] dib-lab/khmer#523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diginorm in context of reference based transcriptome analysis #69

Diginorm in context of reference based transcriptome analysis #69

drtamermansour commented Nov 12, 2015

drtamermansour commented Oct 4, 2016

Diginorm in context of reference based transcriptome analysis #69

Diginorm in context of reference based transcriptome analysis #69

Comments

drtamermansour commented Nov 12, 2015

drtamermansour commented Oct 4, 2016

Materials:

Analysis pipeline

Results

Conclusion