Gene names, data corruption and Excel: a 2021 update

It’s an old favourite of this blog, isn’t it. We had Gene name errors and Excel: lessons not learned (2012). Followed by Data corruption using Excel: 12+ years and counting (2016). Perhaps most depressingly of all, the conclusion of the trilogy, When your tools are broken, just change the data (2019-20).

Well, I’m happy (?) to see the publication of the latest instalment, inspired in part by the title of my first post: Gene name errors: Lessons not learned, from Mark Ziemann’s group. Here’s the accompanying Twitter thread. Summary: it’s even worse than we thought.

Tagging this one with the R tag, because the group are publishing monthly RMarkdown reports here. Congratulations Nature Communications!

As a footnote: you don’t escape this kind of thing when you leave bioinformatics. I listened to a colleague in a data science meeting yesterday declare that “we won’t be putting anything into production that relies on data supplied to us as spreadsheets”.

When your tools are broken, just change the data

Update August 7 2020
The gene symbol renaming is now official. Here’s the publication (not open access, should be), coverage at The Verge and more coverage at The Register. The latter with quotes from me.

It’s been 3 years since we last visited that old favourite recurring topic, data corruption by Excel. Specifically, the unwanted auto-conversion of identifiers that look like dates, e.g. SEPT1, to – well, dates.

Here’s a new twist – well, a two year-old twist in fact, as I don’t keep up to date with this field any longer:

Yes, in 2017 the HGNC decided that the solution to this long-standing issue is to rename the offending genes to prevent the auto-conversion. I’m yet to determine whether anything more came of the proposal.

It is I suppose a practical suggestion that will work. The newsletter states that:

Our initial consultation with the research community publishing on these genes had very mixed results

I bet it did. However, given that ongoing consultation with the research community about the inappropriate use of software has had essentially no results in 15+ years, perhaps it is the most effective solution to the problem.

Data corruption using Excel: 12+ years and counting

Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.

And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?

Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.

What if there is no tomorrow? There wasn’t one today.

We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.

When life gives you coloured cells, make categories

Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.

I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.
Continue reading

Gene name errors and Excel: lessons not learned

June 23, 2004. BMC Bioinformatics publishes “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”. We roll our eyes. Do people really do that? Is it really worthy of publication? However, we admit that if it happens then it’s good that people know about it.

October 17, 2012. A colleague on our internal Yammer network writes:
Read the rest…