Data corruption using Excel: 12+ years and counting

Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.

And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?

Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.

What if there is no tomorrow? There wasn’t one today.

We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.

2 thoughts on “Data corruption using Excel: 12+ years and counting

  1. Very helpful, but one more thing you could say would be to suggest programs other than Excel that we could use.

    • Yes indeed.

      I think there are 2 issues here. One is how do researchers view, sort, edit their own data. The other is how do they share with other people, e.g. supplemental data.

      For the first: if you must use a spreadsheet, I believe that the date conversion issue does not occur in Google Sheets. Even better solutions would be (1) learn how to use a relational database such as MySQL or (2) learn how to work with plain text files (such as CSV). If you read the article in the first link, it contains advice on how to fix or troubleshoot the issue. Clearly most biologists have never read it.

      For the second: there is simply no reason to share data of this kind in Excel files. Most likely, you use software which exports the data as delimited text (tab-delimited, CSV etc.) So just share that directly instead. Skip the Excel import step which mangles the data.

Comments are closed.