When your tools are broken, just change the data

October 10, 2019August 7, 2020 / nsaunders

Update August 7 2020
The gene symbol renaming is now official. Here’s the publication (not open access, should be), coverage at The Verge and more coverage at The Register. The latter with quotes from me.

It’s been 3 years since we last visited that old favourite recurring topic, data corruption by Excel. Specifically, the unwanted auto-conversion of identifiers that look like dates, e.g. SEPT1, to – well, dates.

Here’s a new twist – well, a two year-old twist in fact, as I don’t keep up to date with this field any longer:

TIL that SEPT genes were renamed in 2017 to SEPTIN genes by the HGNC https://t.co/2UadZUMLCS pic.twitter.com/jCo0Hcf6sf

— mdziemann (@mdziemann) October 8, 2019

Yes, in 2017 the HGNC decided that the solution to this long-standing issue is to rename the offending genes to prevent the auto-conversion. I’m yet to determine whether anything more came of the proposal.

It is I suppose a practical suggestion that will work. The newsletter states that:

Our initial consultation with the research community publishing on these genes had very mixed results

I bet it did. However, given that ongoing consultation with the research community about the inappropriate use of software has had essentially no results in 15+ years, perhaps it is the most effective solution to the problem.

Price’s Protein Puzzle: 2019 update

January 30, 2019October 11, 2019 / nsaunders

Chains of amino acids strung together make up proteins and since each amino acid has a 1-letter abbreviation, we can find words (English and otherwise) in protein sequences. I imagine this pursuit began as soon as proteins were first sequenced, but the first reference to protein word-finding as a sport is, to my knowledge, “Price’s Protein Puzzle”, a letter to Trends in Biochemical Sciences in September 1987 [1].

Price wrote:

It occurred to me that TIBS could organise a competition to find the longest word […] contained within any known protein sequence.

The journal took up the challenge and published the winning entries in February 1988 [2]. The 7-letter winner was RERATED, with two 6-letter runners-up: LEADER and LIVELY. The sub-genre “biological words in protein sequences” was introduced almost one year later [3] with the discovery of ALLELE, then no more was heard until 1993 with Gonnet and Benner’s Nature correspondence “A Word in Your Protein” [4].

Noting that “none of the extensive literature devoted to this problem has taken a truly systematic approach” (it’s in Nature so one must declare superiority), this work is notable for two reasons. First, it discovered two 9-letter words: HIDALGISM and ENSILISTS. Second, it mentions the technique: a Patricia tree data structure, and that the search took 23 minutes.

Comments on this letter noted one protein sequence that ends with END [5] and the discovery of 10-letter, but non-English words ANNIDAVATE, WALLAWALLA and TARIEFKLAS [6].

I last visited this topic at my blog in 2008 and at someone else’s blog in 2015. So why am I here again? Because the Aho-Corasick algorithm in R, that’s why!

Continue reading →

Data corruption using Excel: 12+ years and counting

August 25, 2016October 6, 2016 / nsaunders / 2 Comments

Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.

And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?

Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.

What if there is no tomorrow? There wasn’t one today.

We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.

R 3.1 -> 3.2 upgrade notes

April 20, 2015April 20, 2015 / nsaunders / 6 Comments

My machines upgraded from R version 3.1.3 to version 3.2.0 last week, which means that existing code suddenly cannot find packages and so fails. Some notes to myself, possibly useful to others, for what to do when this happens. Relevant to Ubuntu-based systems (I use Linux Mint).

1. Update packages

cp ~/R/x86_64-pc-linux-gnu-library/3.1 ~/R/x86_64-pc-linux-gnu-library/3.2
update.packages(checkBuilt=TRUE, ask=FALSE)

1.1. rJava issues
My rJava installation failed because code was trying to compile against jni.h which was not present on my system. Solution:

sudo apt-get install openjdk-7-jdk
sudo R CMD javareconf

and then in R:

install.packages("rJava")

2. Update Bioconductor
Bioconductor is also upgraded so requires more than a package update. Probably need a new R session for this one.

remove.packages("BiocInstaller")
source("http://bioconductor.org/biocLite.R")
biocLite()

2.1. ChemmineR
My Bioconductor Chemminer update failed because package gridExtra was absent:

install.packages("gridExtra")
biocLite("ChemmineR")

3. General issues
When R is installed on Linux Mint, some packages are installed by default in /usr/lib/R/library. When performing updates as a non-root user, you’ll see messages telling you that this location is not writable and asking if you want to use your own library location. If you reply “yes”, you’ll have packages in both system and user locations. It’s probably better to say “no” and let the Ubuntu package management system handle the package upgrades…although when I tried that, the entire upgrade process halted…

And now we are all done so (careful!):

rm -rf ~/R/x86_64-pc-linux-gnu-library/3.1

Hell is other people’s data

July 30, 2014July 30, 2014 / nsaunders / 12 Comments

“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.

I present what follows as “a typical day in the life of a bioinformatician”.

Continue reading →

Tool tip: dropbox-restore

July 23, 2014July 23, 2014 / nsaunders

I’m currently rather sleep-deprived and prone to doing stupid things. Like this, for example:

rsync -av ~/Dropbox /path/to/backup/directory/

where the directory /path/to/backup/directory already contains a much-older Dropbox directory. So when I set up a new machine, install Dropbox and copy the Dropbox directory back to its default location – hey! What happened to all my space? What are all these old files? Oh wait…I forgot to delete:

rsync -av --delete ~/Dropbox /path/to/backup/directory/

Now, files can be restored of course, but not when there are thousands of them and I don’t even know what’s old and new. What I want to do is restore the directories under ~/Dropbox to the state that they were in yesterday, before I stuffed up.

Luckily Chris Clark wrote dropbox-restore. It does exactly what it says on the tin. For example:

python restore.py /Camera\ Uploads 2014-07-22

Thanks Chris!

utils4bioinformatics: all those “little snippets” in one place

June 23, 2014 / nsaunders / 6 Comments

Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?

Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.

Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.

On the road: CSS and eResearch Conference 2014

March 20, 2014 / nsaunders

Next week I’ll be in Melbourne for one of my favourite meetings, the annual Computational and Simulation Sciences and eResearch Conference.

The main reason for my visit is the Bioinformatics FOAM workshop. Day 1 (March 27) is not advertised since it is an internal CSIRO day, but I’ll be presenting a talk titled “SQL, noSQL or no database at all? Are databases still a core skill?“. Day 2 (March 28) is open to all and I’ll be talking about “Learning from complete strangers: social networking for bioinformaticians“.

I imagine these and other talks will appear on Slideshare soon, at both my account and that of the Australian Bioinformatics Network.

I’m also excited to see that Victoria Stodden is presenting a keynote at the main CSS meeting (PDF) on “Reproducibility in Computational Science: Opportunities and Challenges”.

Hope to see some of you there.

Quilt plots. Like heat maps, only…heat maps

January 16, 2014 / nsaunders / 23 Comments

Stephen tweets:

Quilt Plots: A Simple Tool for the #Visualisation of Large Epidemiological Data http://t.co/mjYCo0nRTv

— Stephen Rudd (@SAGRudd) January 15, 2014

A “quilt plot”

Quilt plots. Sounds interesting. The link points to a short article in PLoS ONE, containing a table and a figure. Here is Figure 1.

If you looked at that and thought “Hey, that’s a heat map!”, you are correct. That is a heat map. Let’s be quite clear about that. It’s a heat map.

So, how do the authors justify publishing a method for drawing heat maps and then calling them “quilt plots”?
Read the rest…

Credit for code: enough with the half-measures already

January 6, 2014January 6, 2014 / nsaunders / 8 Comments

May as well begin 2014 where we left off: complaining about the attitude of scientific publishers regarding reproducible computational research.

Read the rest…

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

computing