Tag Archives: reproducibility

New ways to butcher biological data using Excel

I must have a minor reputation as a critic of Excel in bioinformatics, since strangers are now sending contributions to my work email address. Thanks, you know who you are!

PLOS ONE  Online Survival Analysis Software to Assess the Prognostic Value of Biomarkers Using Transcriptomic Data in Non Small Cell Lung Cancer

When asked why I didn’t mask this email address, I replied “the authors didn’t”

This week: Online Survival Analysis Software to Assess the Prognostic Value of Biomarkers Using Transcriptomic Data in Non-Small-Cell Lung Cancer. Scroll on down to supporting Table S1 and right there on the page, staring you in the face is a rather unusual-looking microarray probeset ID.

I wonder if we should start collecting notable examples in one place?

To be fair, this is more human error than an issue with Excel per se, but I’m going to argue that using Excel promotes sloppy data management errors by making minds lazy :)

It’s #overlyhonestmethods come to life!

Retraction Watch reports a study of microarray data sharing. The article, published in Clinical Chemistry, is itself behind a paywall despite trumpeting the virtues of open data. So straight to the Open Access Irony Award group at CiteULike it goes.

I was not surprised to learn that the rate of public deposition of data is low, nor that most deposited data ignores standards and much of it is low quality. What did catch my eye though, was a retraction notice for one of the articles from the study, in which the authors explain the reason for retraction.
Read the rest…

Gene name errors and Excel: lessons not learned

June 23, 2004. BMC Bioinformatics publishes “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”. We roll our eyes. Do people really do that? Is it really worthy of publication? However, we admit that if it happens then it’s good that people know about it.

October 17, 2012. A colleague on our internal Yammer network writes:
Read the rest…

Reproducibility: releasing code is just part of the solution

This week in Retraction Watch: Hypertension retracts paper over data glitch.

The retraction notice describes the “data glitch” in question (bold emphasis added by me):

…the authors discovered an error in the code for analyzing the data. The National Health and Nutrition
Examination Survey (NHANES) medication data file had multiple observations per participant and
was merged incorrectly with the demographic and other data files. Consequently, the sample size was
twice as large as it should have been (24989 instead of 10198). Therefore, the corrected estimates of
the total number of US adults with hypertension, uncontrolled hypertension, and so on, are significantly
different and the percentages are slightly different.

Let’s leave aside the observation that 24989 is not 2 x 10198. I tweeted:

Not that simple though, is it? Read on for the Twitter discussion.
Read the rest…

Nature on reproducible research

I imagine that most people, when asked “do you think that independent confirmation of research findings is important?” would answer “yes”. I also imagine, when told that in most cases this is not possible, that those people might be concerned or perhaps incredulous. However, this really is the case, which is why I spend much of my working life in a state of concern and incredulity.

Over the years, many articles have been written on how to improve this state of affairs by adopting best practices, collectively-termed reproducible research. One of the latest is an editorial in Nature. I’ve pulled out a few quotes for discussion.
Read the rest…

Trust no-one: errors and irreproducibility in public data

Just when I was beginning to despair at the state of publicly-available microarray data, someone sent me an article which…increased my despair.

The article is:

Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology (2009)
Keith A. Baggerly and Kevin R. Coombes
Ann. Appl. Stat. 3(4): 1309-1334

It escaped my attention last year, in part because “Annals of Applied Statistics” is not high on my journal radar. However, other bloggers did pick it up: see posts at Reproducible Research Ideas and The Endeavour.

In this article, the authors examine several papers in their words “purporting to use microarray-based signatures of drug sensitivity derived from cell lines to predict patient response.” They find that not only are the results difficult to reproduce but in several cases, they simply cannot be reproduced due to simple, avoidable errors. In the introduction, they note that:

…a recent survey [Ioannidis et al. (2009)] of 18 quantitative papers published in Nature Genetics in the past two years found reproducibility was not achievable even in principle for 10.

You can get an idea of how bad things are by skimming through the sub-headings in the article. Here’s a selection of them:
Read the rest…

Poor reproducibility: understandable, if not desirable

Greg Wilson once told me a statistic concerning the mean lifetime of research software reproducibility. That is, the time that elapses on average after which you cannot reproduce your own results using your own code, never mind anyone else’s. I forget the exact number but it was not high – a few months at best.

Why does this happen, aside from obvious bad practices? Well, here’s a typical exchange in an academic research setting:
Read the rest…