Are we ready to move beyond MSigDB and start a community-based gene set resource?

Gene sets are distilled information about molecular profiling experiments and can generated based on other features shared by groups of genes such as chromosomal position, sequence, co-regulation, functional information, etc.

These are a valuable resource because they suggest similarities between different molecular profiling experiments or phenomona and lead researchers into understanding the factors that drive the trends in profiling experiments such as gene expression assays by microarray or RNA-seq.

To truly grasp the importance of quality gene sets, consider that the original paper describing the GSEA algorithm has accumulated 3144 citations since 2003, while the paper describing the software and wider applicability of GSEA has 7166 citations. The latter paper has also attracted positive comments from experts in the field on PubMed, here is one that I couldn't agree with more. In the words of Rafael Irizarry, "The idea of analyzing differential expression for groups of genes, as opposed to individual genes, was an important step forward in the analysis of gene expression data."


How it currently works

MSigDB (The Molecular Signatures Database) is the gene set resource that powers GSEA and many other pathway analyses. MSigDB is a collection of annotated gene sets that currently consists of 10,295 human gene sets from a range of sources. See Table 1.
Table 1. Human gene sets from MSigDB in version 4.0.
As you can see, the folks at MSigDB have done a huge amount of work in collecting gene sets from a variety of sources including several that are empirically defined from microarray or other profiling experiments as well as other gene ontology sources and bioinformatically generated sets. There is some literature on some recommendations to make gene sets, but this is far from being a community-wide consensus.

In addition to MSigDB, you can find gene sets in other places too. Here is a non-exhaustive list.
  • ChEA, an on-line resource of ChIP-derived transcription factor targets
  • GeneSetDB a collection of sets from a variety of sources
  • CMAP a collection of gene profiles of human cells treated with bioactive small molecules
  • WikiPathways a community-powered resource of signalling and biochemical pathways
  • Database of Microarray Marker Genes Microarray analysis portal
  • CleanEX Gene symbol based microarray data analysis portal
  • PAGED Human disease-centric gene 
  • Enrichment Map Gene Sets human and mouse gene sets from a variety of sources.
  • As tables and supplementary information in journal articles

You can also generate them yourself in a number of ways:
  • Through GEO2R for microarray expression studies (example)
  • Mining processed data such as ENCODE project or Epigenome Roadmap Project (example)
  • By performing the entire analysis again for other types of studies including proteomic, genomic, transcriptomic
  • Based on features of protein or gene sequence or structure
  • Meta-analysis of any of the above (example)


What we should be doing better

  • Establish solid guidelines for how researchers should go about generating gene sets of their own.
  • Include non-coding RNAs, especially microRNAs that have been sadly neglected
  • Make a one-stop-shop for all species, not just human.
  • Demonstrate provenance: How was the data processed? How were the statistics done? Who generated the gene sets? The gene set is an important artefact that is used to guide researchers, so why don't they have their own methods section attached.
  • Promote reproducibility: could they be generated today again? What's stopping us from posting the computer code that was used to generate the gene set and reproduce it as a condition of distribution?
  • Boost the update cycle: new data is deposited on GEO and other databases daily, but new gene sets are only being added sporadically. For example MSigDB v4.0 was last updated May 31, 2013. In this fast-paced world of genomics, we need to stay informed of recent studies without needing to undergo primary analysis for all of them. This also extends to when genome annotations are updated and modified, so that when new genes are discovered, the gene sets themselves can be updated too.
  • Target community support: The generation of gene sets is dominated by MSigDB curators. While undoubtedly they have done a splendid job with MSigDB, this means that the rest of the genomics community take a back seat and don't actively participate in submitting new gene sets. Furthermore, the processes used to evaluate validity of gene sets aren't transparent and perhaps the process requires further rigour and standardisation.
  • Give due credit: Insufficient credit is given to the people that curate these gene sets. Post-analysis of data and generation of gene sets is, for many journals, considered to be a trivial exercise, so there is little incentive for researchers to generate and share them, despite their community value.
In many ways, the above points are a symptom of how things were done in 2006 when MSigDB was first published.

So what is the solution?

There could be a few different solutions, it could be as simple as a wiki. Wiki's work, they have been used for similar cases such as WikiPathways and WikiGenes, but I feel that the amount of work required to curate and maintain these gene sets is large enough that most researchers wouldn't be sufficiently motivated to curate and submit gene sets. Moreover, Wikis have been known to be susceptible to vandalism (both well-meaning and malicious) and spam. As these gene sets are to be used by other researchers, quality control needs to be the primary consideration.

The solution is a specialist peer-reviewed journal! 

A peer-reviewed journal can set the standard for valid methods to generate gene sets. Submission could follow a strict template to minimise the work of editors. The source data and methods used are completely and clearly described. The code is housed in a GitHub repository. Gene set authors efforts are rewarded by (rapid) open access publication of their gene sets. Gene set authors (and the authors of the original study) are cited when those gene sets appear in subsequent studies. 

Rather than being controlled by a small set of human genetics researchers in the USA, the journal would service a global community studying all domains of life. The scope need not be limited to gene sets and could evolve according to the dominant analytical methods of the time; such as gene ranks. Moreover, the data may not be limited to genes, and could include proteins, carbohydrates, lipids and other biomolecules. 

For this concept to get off the ground, we will need an experienced and dedicated team to perform a range of tasks, so contact me if you are interested to volunteer as:

  • Editorial board member
  • Peer reviewer
  • Copy editor
  • Proofreader
  • Website developer/maintainers
  • Database developer/curators
  • Author!

Most of all, we need to spread the word about how important quality molecular signatures are to understanding biology. We need to promote the concept, tell your professor, share on social media, tell anyone!

We also need a journal title - so I'll put it out there for you to suggest!

Popular posts from this blog

Data analysis step 8: Pathway analysis with GSEA

A selection of useful bash one-liners

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc