Friday, 14 June 2013

Generate an RNA-seq count matrix: Part 1 - GTF2BED

Previously we took a look at how to align mRNA-Seq reads to a genome with BWA and Tophat. The next step in the pipeline is to count reads which align on exons and then to get them into a count matrix that we can submit for statistical analysis with EdgeR or another program.

At this point in the process, we need to decide which annotation set to use. For human, we commonly use RefSeq and we are beginning to use GENCODE annotation because of its very comprehensive nature. Most of the programs which count reads require a BED file to indicate which genomic regions should be counted.

For this blog post, I will show you how we generate an exon BED file for GENCODE genes. The script below will download the GTF file and generate a temp file list of all exons present in the GTF in the form of a BED file (chromosome, start, end) with the gene accession number and gene name pasted into one field.

chr12 38710380 38710866 ENSG00000175548.4_ALG10B
chr12 38710564 38710866 ENSG00000175548.4_ALG10B
chr12 38710569 38712260 ENSG00000175548.4_ALG10B
chr12 38710641 38710840 ENSG00000175548.4_ALG10B
chr12 38712063 38712260 ENSG00000175548.4_ALG10B
chr12 38713963 38714252 ENSG00000175548.4_ALG10B
chr12 38713963 38715129 ENSG00000175548.4_ALG10B
chr12 38713963 38716388 ENSG00000175548.4_ALG10B
So you might think that this output is fine to use, but there is one complication. The fact that in the above example and many others throughout the genome that there are overlapping exons. If we don't collapse them, then reads assigned to those overlapping regions could be counted multiple times and provide an inaccurate profile of expression.

The solution is to merge overlapping exons right? Well yes, but there is another major caveat in that exons from different genes can sometimes have overlapping exons and we can't collapse exons from different genes, otherwise we wouldn't be able to distinguish them. The solution is to merge exons which only belong to the same gene. If there is a read which aligns to a region with more than one annotation, then it is counted twice because we can't differentiate between the two genes at that position.

As you can see in the script below, the program cycles through all unique genes (over 55k) and then grabs exons matching to each and then does a merge of those BED regions with bedtools. It then adds the merged exons to a final BED output file. As there are over 1 million exons, this process is quite slow, but we only need to perform this task each time a new annotation set is made, so being inefficient is not a big deal. The alternative and faster option to this method is to explode the list of exons based on gene name and then perform a bedtools merge on each of the 55k files and collate the results. Either way at the end of the day day you will get the same result.

#input parameters
#Prepare annotation file
wget $GTFURL
gunzip -f ${GTF}.gz
grep exon $GTF |tr ';' '\t' |tr ' ' '\t' |cut -f1,4,5,10,22 \
| tr -d '"' |sed 's/\t/_/4' |sort -u  > $TMP
for GENE in `cut -f4 $TMP | sort -u`; do
grep $GENE $TMP |tr '\t' '%' |sort -u |tr '%' '\t' |sed 's/\t/_/4' \
| bedtools sort |bedtools merge -nms |cut -d ';' -f1 >> $BED

You will notice that the merged exons are annotated only with the gene name and not with the exon ID. The reason behind this is that our group is mostly interested in overall up- and down-regulation of genes and not necessarily in differential exon splicing, and the annotation we have used allows for easy aggregation of in the later stage of the pipeline, which I will describe shortly.