Sunday, 9 December 2012

Align RNA-seq reads to genome with TopHat

RNA-seq data analysis poses some interesting bioinformatic challenges. One of the most important considerations is the type of reference sequence to be used. If there is a whole genome sequence (WGS), then the WGS is the most appropriate reference, even if it is not annotated entirely. If there is no WGS, then you may need to use a curated cDNA library such as RefSeq, Unigene, or organism specific collection. There are a few reasons why aligning to the genome is the most correct method
  1. MapQ scores mean something - cDNA collections like RefSeq often have splice variants entered as different entries. This means that some variants will have common exons. Reads aligning to those exons will be assigned a mapQ of zero!
  2. Discovery - many of the cDNA sequence collections are incomplete, including RefSeq. Compare the number of entries from GENCODE to RefSeq and you will see that GENCODE is far richer in terms of number of known transcripts. Moreover, if you are aligning to a genome, you can perform something of an ab initio analysis and identify previously uncharacterised regions of high coverage.
  3. Accurate splicing analysis - alignment to genome allows for splicing analysis by quantifying expression on an exon-by-exon approach or by quantifying exon-exon junctions. 
The problem with RNA-seq is that we are trying to align short read to a really large reference sequence and that splicing introduces the possibility of having huge gaps (introns) in the alignment. TopHat was developed to perform ab initio splice junction analysis on mammalian sized genomes. It is optimised for reads about 75nt ot longer. Check out the paper which is well worth a read here.

TopHat uses the bowtie2 aligner, so you will need to have that installed too. For my Ubuntu installation,  also needed to install
  • "build-essential" which contains g++ libraries; can be done with "apt-get install build-essential"
  • "boost" which had to be set up manually (link)
After that, it was fairly straight-forward to download TopHat binaries to somewhere like /usr/local/bin, unpack it and export the contents to path (export PATH=$PATH:/usr/local/bin/tophat-2.0.6.Linux_x86_64/).

First step is to build the reference index with bowtie, in this case the human build Hg19.
bowtie2-build -f hg19.fa hg19
If you can't wait for the build, then you could try downloading the indexes supplied by Illumina here. Before we go ahead and start the alignment, check again that your sequence is high quality and free of adapter contamination. See some of my past posts for help (QC, Quality filtering, adapter clipping). TopHat's parameters are many, including setting the max number of mismatches or gaps for successful alignments, setting expected intron size range, setting rules for handling single and paired end data, ability to use strand-specific alignment. Like Bowtie2, there are also some "pre-set" parameters which allow researchers to run fast optimisation from "very fast" to "very sensitive". You can also provide TopHat a gtf file (genome annotation file, try GENCODE). In that  scenario, TopHat will preferentially align to exon regions before searching for exon junctions and then to other genomic regions. TopHat also has extensive fusion transcript identification tools which I'll discuss in depth in a future post. There are a nearly infinite combination of parameters to try, but here is an example using the gtf information for single end Illumina data:
tophat -p8 -o out_sample1 -G known_genes.gtf \
--transcriptome-index=transcriptome_data/known \
hg19 sample1_1.fq
One thing I really like about TopHat is that it automagically does many of the mundane tasks that are usually done by samtools. This means that we can go from fastq to bam in 1 command, as opposed to the 5 step pipeline I outlined in a previous post using BWA. The aligned and unaligned reads are stored in different bam files which is different to some other pipelines such as BWA.

In summary TopHat is probably the most sophisticated RNA-Seq aligner in use, although there are others on the market including GSNAP, MapSplice and SpliceMap. Next, we will try CuffLinks for downstream analysis of RNA-seq.

TopHat further reading:
TopHat paper
TopHat on BioWulf
RNA aligners reviewed