Wednesday, 10 September 2014

Data analysis step 6: Draw a heatmap from RNA-seq data using R

In the last post of this series, I left you with a gene expression profile of the effect of azacitidine on AML3 cells. I decided to use the DESeq output for downstream analysis. If we want to draw a heatmap at this stage, we might struggle because the output provided by the DEB applet does not send back the normalised count data for each sample.

It is not really useful to plot all 5704 genes with FDR adjusted p-values <0.05 on the heatmap, so I will simply show the top 100 by p-value. Here are the general steps I will use in my R script below:

  1. Read the count matrix and DESeq table into R and merge into one table
  2. Sort based on p-value with most significant genes on top
  3. Select the columns containing gene name and raw counts
  4. Scale the data per row
  5. Select the top 100 genes by significance
  6. Generate the heatmap with mostly default values
Google searches show that R has some quite elaborate heatmap options, especially with features from ggplot2 and RColorBrewer. In this example, I will use built-in R features.

#read in the count matrix
mx<-read.table("Aza_AML3_countmatrix.xls", row.names = 1 , header = TRUE)

#read in the DESeq DGE spreadsheet
dge<-read.table("DEB_DESeq.xls", row.names = 1 , header = TRUE)

#merge the counts onto the DGE spreadsheet

#sort the merged table by p-value
smg<-mg[order(mg$pval), ]

#select only the columns containing the gene names and count data
x<-subset(smg, select = c("Row.names", "UNTR1", "UNTR2", "UNTR3", "AZA1", "AZA2", "AZA3"))

#make the table a data frame with gene names then remove duplicate gene name column
y<-(, row.names = x$Row.names))
x<-subset(y, ,-c(Row.names))

#scale rows

#only grab top 100 by p-value
h<-head(xtst, n = 100L)

#set layout options - adjust if labels get cut off
pdf("heatmap.pdf",width=7, height=8)

#draw heatmap allowing larger margins and adjusting row label font size
heatmap(h, margins = c(4,10), cexRow=.4)

#output plot to file

Heatmap representing gene expression of AML3 cells treated with azacitidine. Only top 100 most significant genes are shown. Statistical analysis was performed using DESeq. Scale: Yellow indicates high  expression and red is low expression.

As you can see, the heatmap shows stark expression changes for these top 100 most significantly differential genes. Also note that the majority of genes in the top 100 are down-regulated.