Gene names, data corruption and Excel: a 2021 update

It’s an old favourite of this blog, isn’t it. We had Gene name errors and Excel: lessons not learned (2012). Followed by Data corruption using Excel: 12+ years and counting (2016). Perhaps most depressingly of all, the conclusion of the trilogy, When your tools are broken, just change the data (2019-20).

Well, I’m happy (?) to see the publication of the latest instalment, inspired in part by the title of my first post: Gene name errors: Lessons not learned, from Mark Ziemann’s group. Here’s the accompanying Twitter thread. Summary: it’s even worse than we thought.

Tagging this one with the R tag, because the group are publishing monthly RMarkdown reports here. Congratulations Nature Communications!

As a footnote: you don’t escape this kind of thing when you leave bioinformatics. I listened to a colleague in a data science meeting yesterday declare that “we won’t be putting anything into production that relies on data supplied to us as spreadsheets”.

One thought on “Gene names, data corruption and Excel: a 2021 update

  1. Spreadsheets are indeed the work of the Devil. And yet they have defenders, believe it or not. There’s articles that pop up on Hacker News and the like from time to time that claim that dislike of spreadsheets is just elitist snobbery from people who can program, and that spreadsheets put data analysis in the hands of non-coders. But they just are so dangerous from the standpoint of data corruption, plus they don’t encourage reproducibility.

Comments are closed.