So far we have been working with fastq files, which contain the reads that were generated during the sequencing experiment. A priori we do not know from which transcripts those reads were originated, and that is precisely what will be addressed in following steps, starting with the mapping. There are essentially two ways of approaching this: one is to align the reads to a known transcriptome or genome, and the other is to assemble these reads de novo into a transcriptome without the need for any reference.
Exercise: Can you think of any advantages/disadvantages for the above mentioned approaches? Solution
Here we decided to align our reads to a known reference genome. To achieve this task, you could use any aligner capable of exporting its results in the SAM/BAM format or in a format that can be easily converted to this one. In the case of RNA-seq data, we also want to be able to retain information about split reads (i.e. reads with a gap) and spliced reads (i.e. those that span multiple exons). There are many aligners available, some of them optimised for working with RNA-seq data (e.g. TopHat, GSNAP... - see Fonseca et al. 2012 for a review). In this practical we decided to use TopHat, and these are the commands we would use to map the fastq files that we have generated during the filtering step:
# do not run - very time consuming
# output already provided in the data/mapped directory
cd reference
# obtain the reference genome from Ensembl
wget ftp://ftp.ensembl.org/pub/release-62/fasta/drosophila_melanogaster/dna/\
Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz
gunzip Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa.gz
# index the reference genome
bowtie-build Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa \
Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel
# align the reads
tophat Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel \
../data/raw/SRR031714_1_filt3.fastq \
../data/raw/SRR031714_2_filt3.fastq
The last command runs TopHat with the default options. A detailed description of those, as well as information on other commands (e.g. for single-end reads), can be found in the manual or just by typing the name of the tool in the console (i.e. type tophat
on its own).
Notice the reference
directory, which contains, amongst other files, the genomic sequence for Drosophila melanogaster (Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa
). We can inspect which chromosomes are present in this fasta file with the following command:
grep '^>' Drosophila_melanogaster.BDGP5.25.62.dna_rm.toplevel.fa | head
Exercise: Suppose we had decided to align to the transcriptome instead. Similarly to what we did with the genome sequence, the transcriptome sequence for Drosophila melanogaster can be obtained from Ensembl with the following command:
# do not run
wget ftp://ftp.ensembl.org/pub/release-62/fasta/drosophila_melanogaster/cdna/\
Drosophila_melanogaster.BDGP5.25.62.cdna.all.fa.gz
gunzip Drosophila_melanogaster.BDGP5.25.62.cdna.all.fa.gz
This file has been provided on the reference
directory. What do the "chromosome" names correspond to in this case?
Solution
We are now familiarised with the input required to align reads to a reference genome or transcriptome. In both cases, the output produced by the mapping tool is going to be stored in SAM/BAM format, which we will inspect in the next section.