Thursday, 14 April 2016

How to generate a rank file from gene expression data in R

Previously I wrote about how to make a gene expression rankfile using the unix shell. While this works for me just fine, there are some differences in the behaviour of shell tools like awk an sed that make it unusable for mac users.

Therefore, I think the best solution is to show you how to do this in R, which you can install on Linux, Mac and Windows systems.

The expression data for this demo looks like this (the first 5 columns).

Name                        logFC               logCPM            LR                PValue
ENSG00000134294.9_SLC38A2   0.365464972841137   8.35504063447063  80.9697378748286  2.29200821312451e-19
ENSG00000164692.13_COL1A2   0.369440969815233   6.9371167845581   59.239238358843   1.39621819298599e-14
ENSG00000117152.9_RGS4      0.417512339007825   6.27805181231213  48.690887963755   2.99655418444362e-12
ENSG00000120738.7_EGR1      -0.448872068345368  5.72373823077344  40.5159113813129  1.95021420597484e-10
ENSG00000236552.1_RPL13AP5  -0.263118776572713  7.87437056751926  40.4331725003112  2.03457218801857e-10
ENSG00000104332.6_SFRP1     0.493423326062289   4.80359815087205  38.5863976773987  5.23827228439336e-10
ENSG00000187957.7_DNER      0.470592983554025   4.33515164897497  37.0721600878008  1.13837471974503e-09
ENSG00000111335.8_OAS2      0.506694725778856   4.64035138417216  36.5587612062587  1.48132665393875e-09
ENSG00000137809.12_ITGA11   0.358068410134363   6.32535717802246  35.7230925508279  2.27451845482865e-09

Here is the code for reading a tsv and calculating the rank metric you will need to replace column names with the ones in your dataset.

y<-x[,c("Name", "metric")]

Read table command loads the data as a data frame called "x", recognising the data is tab delimited and contains a header line.

Attach command allows R to recognise columns by their name rather than by their index ($1,$2,etc), this just makes things a bt easier.

"x$fcSign=sign(logFC)" generates a new column that is the sign of fold change. 

"x$logP=-log10(PValue)" generates a new column that is the negative log10 of the p-value.

"x$metric=logP/fcSign" generates a new column that is the metric score that we'll use for ranking and pathway analysis

"y<-x[,c("Name", "metric")]" makes a new table called "y" that consists only of gene names and metric scores. 

Write.table command outputs the gene name and rank metric data to a new file called "expression.rnk" gene

In the case above, the gene identifier and gene name are combined and will need to be split before pathway analysis. GSEA can use either Ensembl accession number or gene names. If you use gene names, then take note that many gene names are not unique and will need to be collapsed somehow.