Just when I was beginning to despair at the state of publicly-available microarray data, someone sent me an article which…increased my despair.
The article is:
Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology (2009)
Keith A. Baggerly and Kevin R. Coombes
Ann. Appl. Stat. 3(4): 1309-1334
It escaped my attention last year, in part because “Annals of Applied Statistics” is not high on my journal radar. However, other bloggers did pick it up: see posts at Reproducible Research Ideas and The Endeavour.
In this article, the authors examine several papers in their words “purporting to use microarray-based signatures of drug sensitivity derived from cell lines to predict patient response.” They find that not only are the results difficult to reproduce but in several cases, they simply cannot be reproduced due to simple, avoidable errors. In the introduction, they note that:
…a recent survey [Ioannidis et al. (2009)] of 18 quantitative papers published in Nature Genetics in the past two years found reproducibility was not achievable even in principle for 10.
You can get an idea of how bad things are by skimming through the sub-headings in the article. Here’s a selection of them:
- Training data sensitive/resistant labels are reversed
- Heatmaps show sample duplication in the test data
- Only 84/122 test samples are distinct; some samples are labeled both sensitive and resistant
- At least 3/8 of the test data is incorrectly labeled resistant
- Two of the “outlier” genes are not on the arrays used
- Genes are offset, and sensitive/resistant labels are reversed for pemetrexed
- Treatment is confounded with run date
- The gene list doesn’t match the heatmap
- Sensitive/resistant label reversal is common
Following a detailed analysis of several case studies, they conclude that:
…the most common errors are simple…conversely, it is our experience that the most simple errors are common.
Finally, the authors offer suggestions as to how research reproducibility might be improved. Not surprisingly, since they are statisticians who use R, they offer Sweave as part of the solution.
This is a great article, which deserves to be more widely-read. It strikes me that most bioinformatics is “forensic bioinformatics”; picking through messy data in the hope of figuring out what’s going on – or perhaps, who committed the crime.