But there is a growing problem.
The problem is a lack of uniformity of processed data on GEO. Processed data has assorted reference genomes, gene annotation sets, accession numbers, software pipelines, statistical analyses and output formats, that in most cases makes comparison of two experiments hard if not impossible, let alone three or more experiments. This is a burden on researchers who want to quickly extract expression information from public RNA-seq data. Many researchers then resort to downloading the raw data in SRA format then processing it with QC/alignment/quantification/statistical pipelines in Linux shell, R and other programming languages. So for many researchers, this is just too much effort to go through to get answers and they might give up, or hand over the work to their bioinformatics core. To me, this is a real shame. Novice biologists feel dependent on their overworked bioinformatics core to get basic data processing; bioinformatics cores are processing these data instead of performing more creative original work; the data isn't re-used as frequently as it should; and SRA is bombarded by requests for entire raw data sets despite end-users mostly just needing a basic expression profile, that clogs NCBI's bandwidth.
The problem is solvable.
The development of GEO2R (paper) is a great precedent. A major achievement. It allowed novice users to search GEO for microarray datasets of interest and use the GUI to perform valid statistical analysis independent of the bioinformatics core. The challenges to developing GEO2R are considerable, primarily the myriad microarray platforms each with their unique probe sets and accession numbers. Having solved this compatibility issue for most array platforms, GEO2R has vastly increased re-use rate and made "data review" as important as literature review for new projects*. The question now is "how can we make this happen for RNA-seq data?"
That is a question I've been working on for the past year. I noticed that RNA-seq analysis software had just undergone some major improvements, specifically STAR aligner had just made the alignment step about 50 times faster compared to TopHat; and featureCounts represented a 22 fold speedup compared to HTSeq. These two developments alone meant that it was now feasible for a relatively small server cluster to download and process the entire RNA-seq related data sets from SRA. So from a relatively small set of scripts that downloaded, mapped and counted exonic tags for a specific GEO series number, the idea grew to a point where we processed all data for human, mouse, rat, fly, worm, Arabidopsis, yeast, zebrafish and E. coli. That's about 120,000 RNA-seq data sets, that took about six months for three 32-core servers to process. It was worth it :)
Best yet, unlike other resources such as Genevestigator where users need to register for access and in the case of non-academic users, buy access; ours is totally free for all users. Its called Digital Expression Explorer (http://dee.bakeridi.edu.au/index.html). Here is the pipeline for base-space data:
- Only considered the first read in the pair.
- Performed FastQC to assess quality and save log file.
- Used Fastx-quality-trimmer to remove bad bases from the 3' end.
- Mapped reads to the reference genome with STAR.
- Used featureCounts to count number reads assigned to genes.
|Ziemann M, Kaspi A, Lazarus R, El-Osta A. Digital Expression Explorer: A user-friendly repository of uniformly processed RNA-seq data. ComBio2015 DOI: 10.13140/RG.2.1.1707.5926 See our poster|
What I'm really excited about is that DEE count data can be immediately uploaded to online RNA-seq analysis tools such as Degust (highly recommended). The process literally takes about 1-2 minutes from GEO series number to differential expression statistics. Take a look at our YouTube clip for an example.
That workflow can even be done on tablet computers or smartphones. It really has the potential to "democratise" RNA-seq analysis that has previously been the domain of a select few highly trained bioinformaticians with powerful Linux servers.
But we haven't forgotten about the needs of advanced users, we have provided a neat R function that programmatically downloads and loads data into R. We have also made the bulk data sets freely available and we're looking to have Galaxy integration at some point too.
So give DEE a try. I'm happy to field any questions you may have and would love to hear your feedback and suggestions.
I would personally like to thank everyone who has helped along the way, including Antony Kaspi, Ross Lazarus, Assam El-Osta, Haloom Rafehi, the entire Human Epigenetics Lab at Baker IDI and our IT department, who provided help with setting up the web server, especially Richard Lee and Marcus Benson
Rung J1, Brazma A. Reuse of public genome-wide gene expression data. Nat Rev Genet. 2013 14(2):89-99.
User friendly RNA-seq differential expression analysis with Degust