It’s been 3 years since we last visited that old favourite recurring topic, data corruption by Excel. Specifically, the unwanted auto-conversion of identifiers that look like dates, e.g. SEPT1, to – well, dates.
Here’s a new twist – well, a two year-old twist in fact, as I don’t keep up to date with this field any longer:
Yes, in 2017 the HGNC decided that the solution to this long-standing issue is to rename the offending genes to prevent the auto-conversion. I’m yet to determine whether anything more came of the proposal.
It is I suppose a practical suggestion that will work. The newsletter states that:
Our initial consultation with the research community publishing on these genes had very mixed results
I bet it did. However, given that ongoing consultation with the research community about the inappropriate use of software has had essentially no results in 15+ years, perhaps it is the most effective solution to the problem.
I’m not saying this is a good idea, but bear with me.
Chains of amino acids strung together make up proteins and since each amino acid has a 1-letter abbreviation, we can find words (English and otherwise) in protein sequences. I imagine this pursuit began as soon as proteins were first sequenced, but the first reference to protein word-finding as a sport is, to my knowledge, “Price’s Protein Puzzle”, a letter to Trends in Biochemical Sciences in September 1987 .
It occurred to me that TIBS could organise a competition to find the longest word […] contained within any known protein sequence.
The journal took up the challenge and published the winning entries in February 1988 . The 7-letter winner was RERATED, with two 6-letter runners-up: LEADER and LIVELY. The sub-genre “biological words in protein sequences” was introduced almost one year later  with the discovery of ALLELE, then no more was heard until 1993 with Gonnet and Benner’s Nature correspondence “A Word in Your Protein” .
Noting that “none of the extensive literature devoted to this problem has taken a truly systematic approach” (it’s in Nature so one must declare superiority), this work is notable for two reasons. First, it discovered two 9-letter words: HIDALGISM and ENSILISTS. Second, it mentions the technique: a Patricia tree data structure, and that the search took 23 minutes.
Comments on this letter noted one protein sequence that ends with END  and the discovery of 10-letter, but non-English words ANNIDAVATE, WALLAWALLA and TARIEFKLAS .
I last visited this topic at my blog in 2008 and at someone else’s blog in 2015. So why am I here again? Because the Aho-Corasick algorithm in R, that’s why!
You know the drill by now. Grab the tweets. Generate the report using RMarkdown. Push to Github. Publish the report.
This time it’s the Australian Bioinformatics & Computational Biology Society Conference 2017, including the COMBINE symposium. Looks like a good time was had by all in Adelaide.
A couple of quirks this time around. First, the rtweet package went through a brief phase of returning lists instead of nice data frames. I hope that’s been discarded as a bad idea :) There also seem to be additional columns, new column names and list-columns in the output from the latest
search_tweets(), so there goes my previous code…
Second, given that most Twitter users have had 280 characters since about November 7, is this reflected in the conference tweets?
With thanks to Andrew Lonsdale for clearing up my confusion and pointing me to Twitter extended mode, the answer is “yes, somewhat”. Plenty of tweets are still hitting the 140 limit though: time to update those clients?
Search all the hashtags
ISMB (Intelligent Systems for Molecular Biology – which sounds rather old-fashioned now, doesn’t it?) is the largest conference for bioinformatics and computational biology. It is held annually and, when in Europe, jointly with the European Conference on Computational Biology (ECCB).
I’ve had the good fortune to attend twice: in Brisbane 2003 (very enjoyable early in my bioinformatics career, but unfortunately the seed for the “no more southern hemisphere meetings” decision), and in Toronto 2008. The latter was notable for its online coverage and for me, the pleasure of finally meeting in person many members of the online bioinformatics community.
The 2017 meeting (and its satellite meetings) were covered quite extensively on Twitter. My search using a variety of hashtags based on “ismb”, “eccb”, “17” and “2017” retrieved 9052 tweets, which form the basis of this summary. Code and raw data can be found at Github.
Usually I just let these reports speak for themselves but in this case, I thought it was worth noting a few points:
July 21-22 saw the 18th incarnation of the Bioinformatics Open Source Conference, which generally precedes the ISMB meeting. I had the great pleasure of attending BOSC way back in 2003 and delivering a short presentation on Bioperl. I knew almost nothing in those days, but everyone was very kind and appreciative.
My trusty R code for Twitter conference hashtags pulled out 3268 tweets and without further ado here is the Github repository, where you can view the markdown report in the
The ISMB/ECCB meeting wraps today and analysis of Twitter coverage for that meeting will appear here in due course.
Back in February, I wrote some R code to analyse tweets covering the 2017 Lorne Genome conference. It worked pretty well. So I reused the code for two recent bioinformatics meetings held in Sydney: the Sydney Bioinformatics Research Symposium and the VIZBI 2017 meeting.
So without further ado, here are the reports in markdown format, which display quite nicely when pushed to Github:
and you can dig around in the repository for the Rmarkdown, HTML and image files, if you like.
Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.
And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?
Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.
What if there is no tomorrow? There wasn’t one today.
We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.
Just a short note to alert you to a publication with my name on it. Great work by lead author and former colleague Aidan; I just did “the Gephi stuff”. If you’re interested in bioinformatics applications of Apache Spark, take a look at:
VariantSpark: population scale clustering of genotype information
Happy to report it is open access.
After my previous post on extracting virus hosts from NCBI Taxonomy web pages, Pierre wrote:
An excellent idea and here’s my first attempt.
Here’s a count of hosts. By the way NCBI, it’s environment.
cut -f4 virus_host.tsv | sort | uniq -c
1 fungi| plants| invertebrates
181 invertebrates| plants
7 invertebrates| vertebrates
115052 vertebrates| human
43 vertebrates| human stool
225 vertebrates| invertebrates
656 vertebrates| invertebrates| human