Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diginorm in context of reference based transcriptome analysis #69

Open
drtamermansour opened this issue Nov 12, 2015 · 1 comment
Open

Comments

@drtamermansour
Copy link
Member

I am summarizing my experince with digital normalization in reference based RNAseq analysis and possible future directions. I went through 3 use cases:

  1. Diginorm for a group of samples in one experiment with k 20 and coverage 20x followed by using all output reads as one sample to make one transcriptome using Cufflinks. I compare the results to the output of classical Cufflinks pipeline where we assemble a transcriptome for each sample then merge all into one final transcriptome using cuffmerge. I got 2 main observations:
    • Good thing: We catch transcripts with diginorm that we miss with classical cufflinks pipeline. Apparently, these are low abundant transcripts that we enrich for by digital normalization. Theoretically we can obtain the same results by merging all input reads to go through Cufflinks as a single sample, however this is computationally unfeasible once you reach couple GBs of sequencing. Most probably this is not important for differential expression analysis however I would assume that this will be extremely important for non-coding RNA analysis (I will explain later)
    • Bad thing: We miss a lot of exon-exon junctions of highly covered genes because we trash those important reads versus enrichment of reads from primary transcripts
  2. Diginorm of samples belonging to the same experiment (as in no 1) at 10x coverage only then pooling all reads from all experiments (total 8 experiments). This is a scaling up of the first experiment. This integrative analysis is almost impossible without an idea like diginorm. You can view the resultant assembly by following this link. Yes, we miss genes found by other assemblies but we find new genes/isoforms and extended the UTRs of several gene models. Simply it is a new way of visualizing the data. Of course we need to add specific filters to remove predicted primary transcripts (I am working already on something) but I think this can turn into something really useful.
  3. Diginorm of single deep sequence samples at k 20 and C 200x. YES, 200x, I mean it. The output is AMAZING. With classical Cufflinks, a 20GB sample failed to finish (several times) after 7 days with 32 processors. After 200x normalization of mapped reads (~16GB) which retain ~ 40% of the reads, it took cufflinks 13 min to finish the job. The assembly is very comparable to the other samples in the same experiment made by typical pipeline. This needs to be repeated with samples that I know can finish cufflinks where I can compare the effect of diginorm on the same sample at different coverages. You can easily see how this would cut the analysis time of RNAseq assemblies specially with the expected coming deep coverage experiments. Two technical points should be highlighted here:
    • With the current implementation of diginorm, we require huge RAM for these deep sequence samples adding a useless computational limitation while reference based experiments "usually" do not require that much RAM. As I discussed before with @ctb, an implementation of a RAM conservative data structure would enhance the application of diginorm in many situations.
    • Diginorm favors retention of unmappable reads so it is much better to do the diginorm on mapped reads which means adding extra step for extracting mapped reads into fastq files then re-mapping after diginorm. We need diginorm to read BAM(or SAM) files to save these 2 step. I remember discussions around making a reference based diginorm which could be another option but I one more reason to encourage developing a BAM parser is to allow selective retention of exon junctions reads (which can be recognized by the CIGAR string in the BAM file). This would solve the problem seen in point 1 and 2 and might allow diginorm to perform a completely unfeasible integrative analysis.
@drtamermansour
Copy link
Member Author

To follow up on this idea, I decided to find a deeply sequenced RNA sample to test the time and computational resource required for analysis using the new HISAT2/StringTie softwares that replace Tophat2/Cufflinks. The aim was to document how diginorm can improve such a process.

Materials:

I searched the ENCODE project for a deeply sequenced sample. I found a sample prepared from the cytoplasmic fraction of independent growths of cell line SK-N-SH. It is a PE101 Illumina Hi-Seq RNA-Seq library from rRNA-depleted Poly-A+ RNA. The sample has 241,123,024 PE reads.
https://www.encodeproject.org/files/ENCFF000INM/@@download/ENCFF000INM.fastq.gz
https://www.encodeproject.org/files/ENCFF000INT/@@download/ENCFF000INT.fastq.gz

Analysis pipeline

  1. Adaptor trimming with Trimmomatic software v0.36
    (ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:30:10:1 SLIDINGWINDOW:4:2 MINLEN:50)
  2. Alignment to the human genome using HISAT2
    The genome index was obtained from the HISAT2 website. A version that includes genomic variants and transcriptome sequences from Ensembl GRCh38 was selected.
  3. Transcriptome assembly using StringTie

Results

  1. Trimmomatic output
    Input Read Pairs: 241123024 Both Surviving: 239209323 (99.21%) Forward Only Surviving: 1785249 (0.74%) Reverse Only Surviving: 97734 (0.04%) Dropped: 30718 (0.01%)

  2. HISAT output
    239209323 reads; of these:
    239209323 (100.00%) were paired; of these:
    30764268 (12.86%) aligned concordantly 0 times
    145565179 (60.85%) aligned concordantly exactly 1 time
    62879876 (26.29%) aligned concordantly >1 times


    30764268 pairs aligned concordantly 0 times; of these:
    674749 (2.19%) aligned discordantly 1 time


    30089519 pairs aligned 0 times concordantly or discordantly; of these:
    60179038 mates make up the pairs; of these:
    39535687 (65.70%) aligned 0 times
    13289289 (22.08%) aligned exactly 1 time
    7354062 (12.22%) aligned >1 times
    91.74% overall alignment rate

  3. Computational requirement

fields Time cores RAM
HISAT2 2:28:23 8 8GB
StringTie 0:35:25 8 2.2GB

Conclusion

With whatever confidence we can make from one sample, we can conclude that the new software pipeline HISAT2/StringTie is way more faster with much less memory requirement than their predecessors. I do not think diginorm can add much here (the interleave and compression of the 2 reads to fit the input of khmer scripts took more than 18 hours !!) and ths I decided to stop this here .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant