Microarrays, scan dates and Bioconductor: it shouldn’t be this difficult

When dealing with data from high-throughput experimental platforms such as microarrays, it’s important to account for potential batch effects. A simple example: if you process all your normal tissue samples this week and your cancerous tissue samples next week, you’re in big trouble. Differences between cancer and normal are now confounded with processing time and you may as well start over with new microarrays.

Processing date is often a good surrogate for batch and it was once easy to extract dates from Affymetrix CEL files using Bioconductor. It seems that this is no longer the case.

Once upon a time (about 10 months ago), I could write code to process CEL files from the Affymetrix Human Exon 1.0 ST array that looked like this:

# assuming a covdesc file in same directory as CEL files
exp.cel         <- read.affy(path = "data/celfiles")
exp.cel@cdfName <- "exon.pmcdf"
exp.rma         <- rma(exp.cel)

Getting the scan date could not have been easier:

pd <- pData(protocolData(exp.rma))
#                               ScanDate
# HCT_CON_A.CEL     2009-12-11T02:57:33Z
# HCT_CON_B.CEL     2009-12-11T00:56:07Z
# HCT_Treated_A.CEL 2009-12-11T04:27:12Z
# HCT_Treated_B.CEL 2009-12-11T01:26:20Z
# HT29_CON_A.CEL    2009-12-11T03:27:22Z
# HT29_CON_B.CEL    2009-12-11T03:56:58Z

Alas, this code now fails at the first hurdle (reading CEL files):

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
The affy package is not designed for this array type.
Please use either the oligo or xps package.


Fair enough. I’ve used the oligo package before, so we try that instead:

# assuming CEL files in ./data/celfiles
exp.cel <- read.celfiles(list.celfiles("data/celfiles", full.names = T))
exp.rma <- rma(exp.cel)

Wait – what happened to the dates?

pd <- pData(protocolData(exp.rma))
#                                              exprs dates
# HCT_CON_A.CEL         data/celfiles//HCT_CON_A.CEL      
# HCT_CON_B.CEL         data/celfiles//HCT_CON_B.CEL      
# HCT_Treated_A.CEL data/celfiles//HCT_Treated_A.CEL      
# HCT_Treated_B.CEL data/celfiles//HCT_Treated_B.CEL      
# HT29_CON_A.CEL       data/celfiles//HT29_CON_A.CEL      
# HT29_CON_B.CEL       data/celfiles//HT29_CON_B.CEL

# oligo has a runDate() method
# no joy there either
#  [1]                
# Levels:

OK – let’s look at xps. A prerequisite for installation is something called ROOT; not the super user, but a data analysis framework. If you’re on an Ubuntu-like system, ignore all the confusing advice around the Web about setting environment variables and just:

sudo apt-get install root-system

After xps installation we read the manual:

In contrast to other packages, which rely on the Bioconductor method for creating cdf environments, we need to create ROOT scheme files directly from the Affymetrix source files, which need to be downloaded first from the Affymetrix web site. However, once created, it is in principle possible to distribute the ROOT scheme files, too.

I’m sorry. There are only so many hoops that I’m willing to jump through in order to simply read from a file and this is a step too far.

So my current fix – use apt-cel-convert, part of the Affymetrix Power Tools suite, to convert to text and then grep for dates in the format MM/DD/YY:

apt-cel-convert -f text -o . myfile.CEL
grep ^DatHeader myfile.CEL | grep -ioP "\b\d+\/\d+\/\d+\b"
# 01/15/13

I don’t know yet whether those regular expressions work for all cases.

Have a better suggestion? Let’s hear it.

Update – for completeness I’m using R 3.0.1, Bioconductor 2.12, oligo 1.24.2 on Ubuntu 10.04. I’ve also used a CEL file from a different platform, HG U133A, which gave this:

# [1] [0..46114]  10434-10425 #799 133A 8-222-02:CLS=4733 RWS=4733 XIN=3  YIN=3  VE=17        # 2.0 08/22/02 12:08:07    \024  \024 HG-U133A.1sq \024  \024  \024  \024  \024  \024  \024  # \024  \024 6
# Levels: [0..46114]  10434-10425 #799 133A 8-222-02:CLS=4733 RWS=4733 XIN=3  YIN=3  VE=17        # 2.0 08/22/02 12:08:07    \024  \024 HG-U133A.1sq \024  \024  \024  \024  \024  \024  \024  # \024  \024 6

i.e. the entire DatHeader line has been stored as a factor.

3 thoughts on “Microarrays, scan dates and Bioconductor: it shouldn’t be this difficult

  1. No new release of Bioconductor kept legacy code/functions. There’s always something, even if small, that changes, no matter what. In your case, you’re lucky that it doesn’t change the overall result (maybe).

    • True; with the benefit of sleep I recall that this (loss of a favourite function) is common for Bioconductor. Just a little disconcerting to return to code which is not very old or complex and discover that it no longer works.

Comments are closed.