Nature on reproducible research

I imagine that most people, when asked “do you think that independent confirmation of research findings is important?” would answer “yes”. I also imagine, when told that in most cases this is not possible, that those people might be concerned or perhaps incredulous. However, this really is the case, which is why I spend much of my working life in a state of concern and incredulity.

Over the years, many articles have been written on how to improve this state of affairs by adopting best practices, collectively-termed reproducible research. One of the latest is an editorial in Nature. I’ve pulled out a few quotes for discussion.

To ensure their results are reproducible, analysts should show their workings
These words form the article summary. Absolutely. No arguments about that one.

…how should researchers document and report their use of software?
Properly, is the short and simple answer to that one.

…many in the field merely shrug their shoulders and insist that is how things are done
Here, the field is genomics and “how things are done” is “in a non-reproducible way.” I don’t think this is entirely accurate. In my experience, most researchers express profound regret that this is “how things are done” but make excuses for it: lack of time, lack of incentives and focus on results at all costs are frequently-cited factors.

Nature does not require authors to make code available…
Two thoughts: (1) it should and (2) why not? Let’s be honest: work in which the major focus is bioinformatics does not generally appear in Nature. However, Nature publishes other types of research with a strong computational component (physics, climate science). Is their concern that imposing standards might dissuade authors from submission? I have heard that public databases are reluctant to enforce strict standards for this reason, but perhaps that is cynical speculation.

…but we do expect a description detailed enough to allow others to write their own code to do a similar analysis
Without analysing the methods section of many Nature articles, I have no evidence that this is not the case – but I seriously doubt that it is. Anyone who has ever tried to reconstruct an analysis from a description in any journal article, never mind Nature, knows that it is rarely easy and often impossible. There’s no substitute for the code.

Some in the field say that it should be enough to publish only the original data and final results…
The article contains a lot of anonymous “some”, “others” and “many”. Presumably, they’re too ashamed to go on record. I can’t imagine anyone in bioinformatics who would say this.

…given the complexity of the analyses, is it [transparency] realistic?
It sure is. That’s what computers do: calculate stuff repeatedly, over and over. The article implies, in fact, that the 1000 Genomes project does little else but mechanised number crunching.

…it is important that the community consider such solutions [workflows]
This closing sentence seems rather wishy-washy and at odds with the strong article summary. And to digress, “the community” is one of those old-fashioned, clichéd science terms that personally, I can’t stand – along with “the field”.
However, putting that aside: I hope that Nature and other journals count themselves as part of a community – the community of modern, forward-looking twenty-first century science. This community acknowledges that when it comes to reproducible research, the traditional journal article and publication process is a large part of the problem. Journals can take a lead role by (1) enforcing standards; (2) insisting on good reproducible research practices; (3) providing or recommending repositories for code/data and (4) more generally, going beyond the “designed for printing on dead trees” mentality that still persists, over 20 years after the birth of the World Wide Web.

Full disclosure: I’m on the editorial board of BMC Open Research Computation, a journal founded with the explicit aim of promoting and publishing reproducible research.

3 thoughts on “Nature on reproducible research

  1. I’m not sure making “the code” available for published results would be much good unless some sort of official platform is established first (maybe a certain distribution of Linux with specified packages installed, and a requirement for publishing is that the code runs under this system. A huge amount of work is devoted when publishing a tool to make sure that the code is flexible enough to run under a large number of conditions. Typically the code behind published analyses is not going to be up to this caliber and won’t run outside the authors’ own environment, making true reproduction of results impossible. Sure, you could hack the code to get it to work, but then if you get differences, was this due to your hacking?

    • That’s true, merely making code available doesn’t mean that others can use it (although ideally, that would be the case). There are hardware/platform issues, for example if a HPC solution (e.g. a cluster) was used. On the other hand, making code available at least enables others to inspect it for obvious problems. There has been at least one high-profile case where bugs in in-house code led ultimately to retractions.

  2. Simply supplying a virtual machine image and some simple instructions e.g. “open a terminal and type make” would be a simple solution which wouldn’t constrain everyone to having some homogenous environment and frees the user from worrying about other people’s runtime environments.

    As some people may already start their analyses this way (e.g. starting a 16S analysis by downloading a QIIME virtual machine) this may catch on …

Comments are closed.