Simple plots reveal interesting artifacts

I’ve recently been working with methylation data; specifically, from the Illumina Infinium HumanMethylation450 bead chip. It’s a rather complex array which uses two types of probes to determine the methylation state of DNA at ~ 485 000 sites in the genome.

The Bioconductor project has risen to the challenge with a (somewhat bewildering) variety of packages to analyse data from this chip. I’ve used lumi, methylumi and minfi, but will focus here on just the latter.

Update: a related post from Oliver Hofmann (@fiamh).

The data came to me in a standard directory and file structure (created, I think, using BeadStudio), as raw IDAT files with a sample sheet. So, reading the data into R was straightforward:

library(minfi)
library(IlluminaHumanMethylation450kmanifest)
baseDir <- "path/to/datadir"
targets <- read.450k.sheet(baseDir)
RGset   <- read.450k.exp(base = baseDir, targets = targets)

This creates an RGChannelSet:

RGChannelSet (storageMode: lockedEnvironment)
assayData: 622399 features, 108 samples 
  element names: Green, Red 
protocolData: none
phenoData
  sampleNames: 6057841162_R01C01 6057841162_R02C01 ...
    6264488081_R06C02 (108 total)
  varLabels: Sample_Name Sample_Well ... filenames (9 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: IlluminaHumanMethylation450k

The minfi package provides a function to convert the red/green channel values to raw intensity measurements representing methylated or unmethylated:

raw <- preprocessRaw(RGset)

That results in a MethylSet:

MethylSet (storageMode: lockedEnvironment)
assayData: 485577 features, 108 samples 
  element names: Meth, Unmeth 
protocolData: none
phenoData
  sampleNames: 6057841162_R01C01 6057841162_R02C01 ...
    6264488081_R06C02 (108 total)
  varLabels: Sample_Name Sample_Well ... filenames (9 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: IlluminaHumanMethylation450k 
Preprocessing Raw (no normalization or bg correction) minfi version 1.0.0 
Manifest version 0.2.0

The functions getMeth() and getUnmeth() extract the methylated/unmethylated intensities, respectively, to a matrix:

meth <- getMeth(raw)
dim(meth)
## [1] 485577    108

The MethylSet object, raw, has the same structure as the more familiar eSet, so we can extract phenotype data using pData():

names(pData(raw))
## [1] "Sample_Name"  "Sample_Well"  "Sample_Plate" "Sample_Group" "Pool_ID"     
## [6] "Array"        "Slide"        "Basename"     "filenames"

It turns out that for my data, the 108 samples were run in 9 groups of 12 and are ordered as such in the sample sheet: that is samples 1-12 = Group1, 13-24 = Group2 and so on, up to 97-108 = Group9. It also turns out that the variable pData(raw)$Array is ordered in groups of 6. So for the first 18 samples:

head(pData(raw)$Array, 18)
##  [1] "R01C01" "R02C01" "R03C01" "R04C01" "R05C01" "R06C01" "R01C02" "R02C02"
##  [9] "R03C02" "R04C02" "R05C02" "R06C02" "R01C01" "R02C01" "R03C01" "R04C01"
## [17] "R05C01" "R06C01"

Keep those groups and orderings in mind, as we generate a boxplot of raw methylated intensities for the 108 samples:

library(RColorBrewer)
cols <- brewer.pal(9, "Set3")
groups.uniq <- unique(pData(raw)$Sample_Group)
m <- match(pData(raw)$Sample_Group, groups.uniq)
boxplot(meth, col = cols[m])

Result on the right.

meth

Boxplot of raw intensities using minfi getMeth() method, Illumina 450K, 108 samples


Interesting and evidently, something “non-random” is going on. It looks as though we have 2 artifacts to consider. First, for these samples, groups 6-9 differ from groups 1-5. More interestingly, within groups, we see a cyclic signal where intensity rises gradually across the first 6 arrays, then falls and rises again for the second set of 6 arrays.

I love that when I tweet about this technical, esoteric stuff, I get instant and useful responses. Thanks for following: @iameltonjohn, @tweetingtechno and @fiamh.

Take-home message: very simple plots can be informative. Now to figure out what it all means.

3 thoughts on “Simple plots reveal interesting artifacts

  1. chris

    Like many others, batch effects seem to be the leitmotif of my career. I estimate that about 50% of my data “analysis” time – there’s a misnomer! – has been spent formatting data; another 35% or so adjusting batch effects; only 15% is actual analysis. This isn’t really reflected in my code, however, where the relevant proportions are roughly 10%, 50% and 40%! From which I must conclude that formatting is easy to deal with – once you’ve caught all the exceptions.

Comments are closed.