Tuesday, 20 October 2015

Introducing "Digital Expression Explorer"

RNA-seq has been a blessing for molecular biologists, not only does RNA-seq provide unbiased transcriptome wide expression analysis, it can be mined for a variety of other information like splicing, SNV identification, RNA editing, TSS usage, etc. As the cost of RNA-seq declines, more and more labs are using it hence more and more data is being deposited at databases such as SRA and GEO.

But there is a growing problem.

The problem is a lack of uniformity of processed data on GEO. Processed data has assorted reference genomes, gene annotation sets, accession numbers, software pipelines, statistical analyses and output formats, that in most cases makes comparison of two experiments hard if not impossible, let alone three or more experiments. This is a burden on researchers who want to quickly extract expression information from public RNA-seq data. Many researchers then resort to downloading the raw data in SRA format then processing it with QC/alignment/quantification/statistical pipelines in Linux shell, R and other programming languages. So for many researchers, this is just too much effort to go through to get answers and they might give up, or hand over the work to their bioinformatics core. To me, this is a real shame. Novice biologists feel dependent on their overworked bioinformatics core to get basic data processing; bioinformatics cores are processing these data instead of performing more creative original work; the data isn't re-used as frequently as it should; and SRA is bombarded by requests for entire raw data sets despite end-users mostly just needing a basic expression profile, that clogs NCBI's bandwidth.

The problem is solvable.

The development of GEO2R (paper) is a great precedent. A major achievement. It allowed novice users to search GEO for microarray datasets of interest and use the GUI to perform valid statistical analysis independent of the bioinformatics core. The challenges to developing GEO2R are considerable, primarily the myriad microarray platforms each with their unique probe sets and accession numbers. Having solved this compatibility issue for most array platforms, GEO2R has vastly increased re-use rate and made "data review" as important as literature review for new projects*. The question now is "how can we make this happen for RNA-seq data?"

That is a question I've been working on for the past year. I noticed that RNA-seq analysis software had just undergone some major improvements, specifically STAR aligner had just made the alignment step about 50 times faster compared to TopHat; and featureCounts represented a 22 fold speedup compared to HTSeq. These two developments alone meant that it was now feasible for a relatively small server cluster to download and process the entire RNA-seq related data sets from SRA. So from a relatively small set of scripts that downloaded, mapped and counted exonic tags for a specific GEO series number, the idea grew to a point where we processed all data for human, mouse, rat, fly, worm, Arabidopsis, yeast, zebrafish and E. coli. That's about 120,000 RNA-seq data sets, that took about six months for three 32-core servers to process. It was worth it :)

Best yet, unlike other resources such as Genevestigator where users need to register for access and in the case of non-academic users, buy access; ours is totally free for all users. Its called Digital Expression Explorer ( Here is the pipeline for base-space data:
  • Only considered the first read in the pair.
  • Performed FastQC to assess quality and save log file.
  • Used Fastx-quality-trimmer to remove bad bases from the 3' end.
  • Mapped reads to the reference genome with STAR.
  • Used featureCounts to count number reads assigned to genes.
Colour-space data was trimmed with solid-trimmer, mapped with SubJunc, followed by featureCounts. The data obtained from Digital Expression Explorer (DEE) is a matrix of gene expression counts, here is an example for C. elegans.

Geneid GeneName SRR1578745v1 SRR1578747v1 SRR1578746v1
WBGene00000001 aap-1 1342 1200 1257
WBGene00000002 aat-1 653 700 759
WBGene00000003 aat-2 1615 1782 2160
WBGene00000004 aat-3 637 735 695
WBGene00000005 aat-4 829 721 733

You see that DEE includes the Ensembl gene accession number as well as the gene symbol. The column headers are simply the SRA run accession and the DEE release version. The simplicity of the website interface is deliberate. It accepts accession number queries of SRA/GEO accessions like GEO series or SRA projects. You can also use keywords to identify interesting studies. When downloading the count data, analysis logs are also downloaded, enabling end users to look at some quality metrics to see whether the data is good quality and of sufficient depth. For each organism, we selected one experiment and compared DEE data to that from the author (see the poster in the pic below), showing that the Spearman correlation was >0.97 and after DGE analysis with edgeR, the overlap (Jaccard index) of statistically significant DEGs was >0.76.

Ziemann M, Kaspi A, Lazarus R, El-Osta A. Digital Expression Explorer: A user-friendly repository of uniformly processed RNA-seq data. ComBio2015 DOI: 10.13140/RG.2.1.1707.5926 See our poster

What I'm really excited about is that DEE count data can be immediately uploaded to online RNA-seq analysis tools such as Degust (highly recommended). The process literally takes about 1-2 minutes from GEO series number to differential expression statistics. Take a look at our YouTube clip for an example.

That workflow can even be done on tablet computers or smartphones. It really has the potential to "democratise" RNA-seq analysis that has previously been the domain of a select few highly trained bioinformaticians with powerful Linux servers.

But we haven't forgotten about the needs of advanced users, we have provided a neat R function that programmatically downloads and loads data into R. We have also made the bulk data sets freely available and we're looking to have Galaxy integration at some point too.

So give DEE a try. I'm happy to field any questions you may have and would love to hear your feedback and suggestions.

I would personally like to thank everyone who has helped along the way, including Antony Kaspi, Ross Lazarus, Assam El-Osta, Haloom Rafehi, the entire Human Epigenetics Lab at Baker IDI and our IT department, who provided help with setting up the web server, especially Richard Lee and Marcus Benson

*Suggested Reading
Rung J1, Brazma A. Reuse of public genome-wide gene expression data. Nat Rev Genet. 2013 14(2):89-99. 

Related post
User friendly RNA-seq differential expression analysis with Degust