When your tools are broken, just change the data

October 10, 2019August 7, 2020 / nsaunders

Update August 7 2020
The gene symbol renaming is now official. Here’s the publication (not open access, should be), coverage at The Verge and more coverage at The Register. The latter with quotes from me.

It’s been 3 years since we last visited that old favourite recurring topic, data corruption by Excel. Specifically, the unwanted auto-conversion of identifiers that look like dates, e.g. SEPT1, to – well, dates.

Here’s a new twist – well, a two year-old twist in fact, as I don’t keep up to date with this field any longer:

TIL that SEPT genes were renamed in 2017 to SEPTIN genes by the HGNC https://t.co/2UadZUMLCS pic.twitter.com/jCo0Hcf6sf

— mdziemann (@mdziemann) October 8, 2019

Yes, in 2017 the HGNC decided that the solution to this long-standing issue is to rename the offending genes to prevent the auto-conversion. I’m yet to determine whether anything more came of the proposal.

It is I suppose a practical suggestion that will work. The newsletter states that:

Our initial consultation with the research community publishing on these genes had very mixed results

I bet it did. However, given that ongoing consultation with the research community about the inappropriate use of software has had essentially no results in 15+ years, perhaps it is the most effective solution to the problem.

Data corruption using Excel: 12+ years and counting

August 25, 2016October 6, 2016 / nsaunders / 2 Comments

Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.

And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?

Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.

What if there is no tomorrow? There wasn’t one today.

We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.

R 3.1 -> 3.2 upgrade notes

April 20, 2015April 20, 2015 / nsaunders / 6 Comments

My machines upgraded from R version 3.1.3 to version 3.2.0 last week, which means that existing code suddenly cannot find packages and so fails. Some notes to myself, possibly useful to others, for what to do when this happens. Relevant to Ubuntu-based systems (I use Linux Mint).

1. Update packages

cp ~/R/x86_64-pc-linux-gnu-library/3.1 ~/R/x86_64-pc-linux-gnu-library/3.2
update.packages(checkBuilt=TRUE, ask=FALSE)

1.1. rJava issues
My rJava installation failed because code was trying to compile against jni.h which was not present on my system. Solution:

sudo apt-get install openjdk-7-jdk
sudo R CMD javareconf

and then in R:

install.packages("rJava")

2. Update Bioconductor
Bioconductor is also upgraded so requires more than a package update. Probably need a new R session for this one.

remove.packages("BiocInstaller")
source("http://bioconductor.org/biocLite.R")
biocLite()

2.1. ChemmineR
My Bioconductor Chemminer update failed because package gridExtra was absent:

install.packages("gridExtra")
biocLite("ChemmineR")

3. General issues
When R is installed on Linux Mint, some packages are installed by default in /usr/lib/R/library. When performing updates as a non-root user, you’ll see messages telling you that this location is not writable and asking if you want to use your own library location. If you reply “yes”, you’ll have packages in both system and user locations. It’s probably better to say “no” and let the Ubuntu package management system handle the package upgrades…although when I tried that, the entire upgrade process halted…

And now we are all done so (careful!):

rm -rf ~/R/x86_64-pc-linux-gnu-library/3.1

Tool tip: dropbox-restore

July 23, 2014July 23, 2014 / nsaunders

I’m currently rather sleep-deprived and prone to doing stupid things. Like this, for example:

rsync -av ~/Dropbox /path/to/backup/directory/

where the directory /path/to/backup/directory already contains a much-older Dropbox directory. So when I set up a new machine, install Dropbox and copy the Dropbox directory back to its default location – hey! What happened to all my space? What are all these old files? Oh wait…I forgot to delete:

rsync -av --delete ~/Dropbox /path/to/backup/directory/

Now, files can be restored of course, but not when there are thousands of them and I don’t even know what’s old and new. What I want to do is restore the directories under ~/Dropbox to the state that they were in yesterday, before I stuffed up.

Luckily Chris Clark wrote dropbox-restore. It does exactly what it says on the tin. For example:

python restore.py /Camera\ Uploads 2014-07-22

Thanks Chris!

utils4bioinformatics: all those “little snippets” in one place

June 23, 2014 / nsaunders / 6 Comments

Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?

Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.

Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.

Quilt plots. Like heat maps, only…heat maps

January 16, 2014 / nsaunders / 23 Comments

Stephen tweets:

Quilt Plots: A Simple Tool for the #Visualisation of Large Epidemiological Data http://t.co/mjYCo0nRTv

— Stephen Rudd (@SAGRudd) January 15, 2014

A “quilt plot”

Quilt plots. Sounds interesting. The link points to a short article in PLoS ONE, containing a table and a figure. Here is Figure 1.

If you looked at that and thought “Hey, that’s a heat map!”, you are correct. That is a heat map. Let’s be quite clear about that. It’s a heat map.

So, how do the authors justify publishing a method for drawing heat maps and then calling them “quilt plots”?
Read the rest…

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

software