Gene names, data corruption and Excel: the final chapter?

I suppose that after:

it would be remiss of me not to mention: Microsoft fixes the Excel feature that was wrecking scientific data.

Is it really fixed though? Users have to know that the feature exists, find it and toggle a checkbox. Given that the users most “at risk” probably open CSV files in Excel by default simply by clicking on them…I’m not optimistic.

Still, as Mark said:

Price’s Protein Puzzle: 2023 update

One of the joys (?) of having been online for…quite some time now…is watching topics reappear every few years or so.

Yes, it’s Price’s Protein Puzzle which I last wrote about back in 2019. The good news is that my code still runs, so I’ve updated the results of an English word search versus the UniProt Reviewed (Swiss-Prot) protein database. Just for fun I threw in a few other languages too.

So what’s new?

Continue reading

Gene names, data corruption and Excel: a 2021 update

It’s an old favourite of this blog, isn’t it. We had Gene name errors and Excel: lessons not learned (2012). Followed by Data corruption using Excel: 12+ years and counting (2016). Perhaps most depressingly of all, the conclusion of the trilogy, When your tools are broken, just change the data (2019-20).

Well, I’m happy (?) to see the publication of the latest instalment, inspired in part by the title of my first post: Gene name errors: Lessons not learned, from Mark Ziemann’s group. Here’s the accompanying Twitter thread. Summary: it’s even worse than we thought.

Tagging this one with the R tag, because the group are publishing monthly RMarkdown reports here. Congratulations Nature Communications!

As a footnote: you don’t escape this kind of thing when you leave bioinformatics. I listened to a colleague in a data science meeting yesterday declare that “we won’t be putting anything into production that relies on data supplied to us as spreadsheets”.

When your tools are broken, just change the data

Update August 7 2020
The gene symbol renaming is now official. Here’s the publication (not open access, should be), coverage at The Verge and more coverage at The Register. The latter with quotes from me.

It’s been 3 years since we last visited that old favourite recurring topic, data corruption by Excel. Specifically, the unwanted auto-conversion of identifiers that look like dates, e.g. SEPT1, to – well, dates.

Here’s a new twist – well, a two year-old twist in fact, as I don’t keep up to date with this field any longer:

Yes, in 2017 the HGNC decided that the solution to this long-standing issue is to rename the offending genes to prevent the auto-conversion. I’m yet to determine whether anything more came of the proposal.

It is I suppose a practical suggestion that will work. The newsletter states that:

Our initial consultation with the research community publishing on these genes had very mixed results

I bet it did. However, given that ongoing consultation with the research community about the inappropriate use of software has had essentially no results in 15+ years, perhaps it is the most effective solution to the problem.

Price’s Protein Puzzle: 2019 update

Chains of amino acids strung together make up proteins and since each amino acid has a 1-letter abbreviation, we can find words (English and otherwise) in protein sequences. I imagine this pursuit began as soon as proteins were first sequenced, but the first reference to protein word-finding as a sport is, to my knowledge, “Price’s Protein Puzzle”, a letter to Trends in Biochemical Sciences in September 1987 [1].

Price wrote:

It occurred to me that TIBS could organise a competition to find the longest word […] contained within any known protein sequence.

The journal took up the challenge and published the winning entries in February 1988 [2]. The 7-letter winner was RERATED, with two 6-letter runners-up: LEADER and LIVELY. The sub-genre “biological words in protein sequences” was introduced almost one year later [3] with the discovery of ALLELE, then no more was heard until 1993 with Gonnet and Benner’s Nature correspondence “A Word in Your Protein” [4].

Noting that “none of the extensive literature devoted to this problem has taken a truly systematic approach” (it’s in Nature so one must declare superiority), this work is notable for two reasons. First, it discovered two 9-letter words: HIDALGISM and ENSILISTS. Second, it mentions the technique: a Patricia tree data structure, and that the search took 23 minutes.

Comments on this letter noted one protein sequence that ends with END [5] and the discovery of 10-letter, but non-English words ANNIDAVATE, WALLAWALLA and TARIEFKLAS [6].

I last visited this topic at my blog in 2008 and at someone else’s blog in 2015. So why am I here again? Because the Aho-Corasick algorithm in R, that’s why!

Continue reading

Twitter coverage of the Australian Bioinformatics & Computational Biology Society Conference 2017

You know the drill by now. Grab the tweets. Generate the report using RMarkdown. Push to Github. Publish the report.

This time it’s the Australian Bioinformatics & Computational Biology Society Conference 2017, including the COMBINE symposium. Looks like a good time was had by all in Adelaide.

A couple of quirks this time around. First, the rtweet package went through a brief phase of returning lists instead of nice data frames. I hope that’s been discarded as a bad idea :) There also seem to be additional columns, new column names and list-columns in the output from the latest search_tweets(), so there goes my previous code…

Second, given that most Twitter users have had 280 characters since about November 7, is this reflected in the conference tweets?

With thanks to Andrew Lonsdale for clearing up my confusion and pointing me to Twitter extended mode, the answer is “yes, somewhat”. Plenty of tweets are still hitting the 140 limit though: time to update those clients?

Twitter Coverage of the ISMB/ECCB Conference 2017

Search all the hashtags

ISMB (Intelligent Systems for Molecular Biology – which sounds rather old-fashioned now, doesn’t it?) is the largest conference for bioinformatics and computational biology. It is held annually and, when in Europe, jointly with the European Conference on Computational Biology (ECCB).

I’ve had the good fortune to attend twice: in Brisbane 2003 (very enjoyable early in my bioinformatics career, but unfortunately the seed for the “no more southern hemisphere meetings” decision), and in Toronto 2008. The latter was notable for its online coverage and for me, the pleasure of finally meeting in person many members of the online bioinformatics community.

The 2017 meeting (and its satellite meetings) were covered quite extensively on Twitter. My search using a variety of hashtags based on “ismb”, “eccb”, “17” and “2017” retrieved 9052 tweets, which form the basis of this summary. Code and raw data can be found at Github.

Usually I just let these reports speak for themselves but in this case, I thought it was worth noting a few points:
Continue reading

Twitter Coverage of the Bioinformatics Open Source Conference 2017

count-words-1July 21-22 saw the 18th incarnation of the Bioinformatics Open Source Conference, which generally precedes the ISMB meeting. I had the great pleasure of attending BOSC way back in 2003 and delivering a short presentation on Bioperl. I knew almost nothing in those days, but everyone was very kind and appreciative.

My trusty R code for Twitter conference hashtags pulled out 3268 tweets and without further ado here is the Github repository, where you can view the markdown report in the code/R directory.

The ISMB/ECCB meeting wraps today and analysis of Twitter coverage for that meeting will appear here in due course.

Visualising Twitter coverage of recent bioinformatics conferences

Back in February, I wrote some R code to analyse tweets covering the 2017 Lorne Genome conference. It worked pretty well. So I reused the code for two recent bioinformatics meetings held in Sydney: the Sydney Bioinformatics Research Symposium and the VIZBI 2017 meeting.

So without further ado, here are the reports in markdown format, which display quite nicely when pushed to Github:

and you can dig around in the repository for the Rmarkdown, HTML and image files, if you like.