Analysis of gene expression timecourse data using maSigPro

ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposure

ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposure

About a year ago, I did a little work on a very interesting project which was trying to identify blood-based biomarkers for the early detection of stroke. The data included gene expression measurements using microarrays at various time points after the onset of ischemia (reduced blood supply). I had not worked with timecourse data before, so I went looking for methods and found a Bioconductor package, maSigPro, which did exactly what I was looking for. In combination with ggplot2, it generated some very attractive and informative plots of gene expression over time.

I was very impressed by maSigPro and meant to get around to writing a short guide showing how to use it. So I did finally, using RMarkdown to create the document and here it is. The code illustrates how to retrieve datasets from GEO using GEOquery and annotate microarray probesets using biomaRt. Hopefully it’s useful to some of you.

I’ll probably do more of this in the future, since publishing RMarkdown is far easier than copying, pasting and formatting at WordPress.

A brief note: R 3.0.0 and bioinformatics

Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).

Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:

f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)

When this happened:

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
  allocMatrix: too many elements specified

I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.

My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.

R gotcha for the week

I use the biomaRt package from Bioconductor in almost every R session. So I thought I’d load the library and set up a mart instance in my ~/.Rprofile:

library(biomaRt)
mart.hs <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

On starting R, I was somewhat perplexed to see this error message:

Error in bmVersion(mart, verbose = verbose) : 
  could not find function "read.table"

Twitter to the rescue. @hadleywickham told me to load utils first and @vsbuffalo explained that normally, .Rprofile is read before the utils package is loaded. Seems rather odd to me; I’d have thought that biomaRt should load utils if required, but there you go.

So this works in ~/.Rprofile:

library(utils)
library(biomaRt)
mart.hs <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

GEO database: curation lagging behind submission?

geometadb

GSE and GDS records in GEOmetadb by date

I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.

Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records?

library(GEOmetadb)
library(ggplot2)

# update database if required using getSQLiteFile()
# connect to database; assumed to be in user $HOME
con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite")

# fetch "last updated" dates for GDS and GSE
gds <- dbGetQuery(con, "select update_date from gds")
gse <- dbGetQuery(con, "select last_update_date from gse")

# cumulative sums by date; no factor variables
gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F)
gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F)

# make GDS and GSE data frames comparable
colnames(gds.count) <- "count"
colnames(gse.count) <- "count"

# row names (dates) to real dates
gds.count$date <- as.POSIXct(rownames(gds.count))
gse.count$date <- as.POSIXct(rownames(gse.count))

# add type for plotting
gds.count$type <- "gds"
gse.count$type <- "gse"

# combine GDS and GSE data frames
gds.gse <- rbind(gds.count, gse.count)

# and plot records over time by type
png(filename = "geometadb.png", width = 800, height = 600)
print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type)))
dev.off()

biomaRt and GenomeGraphs: a worked example

As promised a few posts ago, another demonstration of the excellent biomaRt package, this time in conjunction with GenomeGraphs.

Here’s what we’re going to do:

  1. Grab some public microarray data
  2. Normalise and get a list of the most differentially-expressed probesets
  3. Use biomaRt to fetch the genes associated with those probesets
  4. Plot the data using GenomeGraphs

If you want to follow along on your own machine, it will need to be quite powerful. We’ll be processing exon arrays, which requires a 64-bit machine with at least 4 GB (and preferably, at least 8-12 GB) of RAM. As usual, I’m assuming some variant of Linux and that you’re comfortable at the command line.
Read the rest…

Beware of rogue header files (Bioconductor installation)

Just a short note concerning a “gotcha”.

As I have many times before, I opened an R console on my newly-upgraded (to lucid 10.04) Ubuntu machine, typed source(“http://bioconductor.org/biocLite.R&#8221;) and began a Bioconductor install with biocLite(). Only this time, I saw this:

Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared library 
 '/home/sau103/R/i486-pc-linux-gnu-library/2.11/affyio/libs/affyio.so':
  /home/sau103/R/i486-pc-linux-gnu-library/2.11/affyio/libs/affyio.so: undefined symbol: egzread
ERROR: loading failed
* removing ‘/home/sau103/R/i486-pc-linux-gnu-library/2.11/affyio’

A quick email to the Bioconductor mailing list put me in touch with the very helpful Martin Morgan, who suggested that I check my zlib libraries. Sure enough, the rogue “egzread” was found in /usr/local/include/zlibemboss.h, along with a second zlib.h file, in addition to /usr/include/zlib.h.

grep egz /usr/local/include/zlibemboss.h
> #define gzread egzread

I moved the rogue zlib.h out of /usr/local/include and order was restored.

So in summary, watch out when installing EMBOSS on Ubuntu – it seems to mess with things that it should not.