Genes x Samples: please explain

One of my bioinformatics pet peeves involves statements like this one, from the CNAmet user guide:

Inputs to CNAmet are three m x n matrices, where m is the number of genes and n the number samples

What we’re looking at here is the hot, but poorly-defined topic of data integration, in which biological measurements from two or more different platforms are somehow combined in a way that provides more information than each platform separately. Read any paper on this topic, download the software and you’ll find example datasets containing two or more matched matrices, with rows where measurements have been summarized to a “gene”. What you won’t find, typically, is a detailed explanation of the summarization procedure that you could implement yourself.


To their credit the authors of CNAmet are quite clear that the procedure used to generate these matrices is not their problem:

Since the three microarray platforms contain non-overlapping probes, the m dimension of the input matrices must match. This is because the problem of mapping measurements (probe to probe mapping) between different array types is not dealt with by CNAmet.

Two problems.

First, let’s face it, the very concept of an object called a “gene” is flawed; what we have in reality are fuzzy locations of transcriptional activity.

Second, some measurements summarize more readily than others. Exon expression arrays, for example, are frequently summarized to “gene level” by taking the median measurement of probesets in a transcript cluster. For copy number arrays, we might typically segment the measurements over each chromosome, then assign a number to a “gene” by determining overlap between gene and segment. However, something like a methylation array is more difficult; probesets map to different transcript-associated features (islands, shores, shelves) – which do we use?

Our group recently looked at several publications which tried to integrate measurements of methylation and gene expression. We found at least half a dozen ways of generating the “genes x samples” matrices, from selecting one probe per gene using particular criteria (e.g. highest variance) to complex clustering procedures based on chromosome coordinates. In one horror show of a study, the authors decided that it was fine to combine methylation data from their study with completely-unrelated publicly-available expression data. Why the reviewers and editors agreed is anyone’s guess.

My second law of bioinformatics, then:

On no account must the data pre-processing steps required to summarize multi-platform measurements to gene level be revealed

Seriously, if you have a great idea about the best way to combine, for example, measurements from the Affymetrix Human Exon 1.0 ST and the Illumina Infinium HumanMethylation450 beadchip – go for it in the comments.