I was asked recently to look at some R code which performs “embarrassingly parallel” computations (the same function, multiple times, different parameters) and see whether I could modify it to run on one of our high-performance computing clusters. The machine has 63 virtual compute nodes and uses the TORQUE batch queue system to allocate nodes to compute jobs.
First stop: the CRAN Task View High-Performance and Parallel Computing with R. Two promising packages there: BatchJobs and BatchExperiments. Their documentation is quite extensive with useful examples, but I found it a little disjointed and confusing. What I wanted was a simple, step-by-step guide to setting up for a first-time user. So here is my attempt. As always, it’s for “Linux-like” systems.
“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.
I present what follows as “a typical day in the life of a bioinformatician”.
I’m currently rather sleep-deprived and prone to doing stupid things. Like this, for example:
rsync -av ~/Dropbox /path/to/backup/directory/
where the directory
/path/to/backup/directory already contains a much-older Dropbox directory. So when I set up a new machine, install Dropbox and copy the Dropbox directory back to its default location – hey! What happened to all my space? What are all these old files? Oh wait…I forgot to delete:
rsync -av --delete ~/Dropbox /path/to/backup/directory/
Now, files can be restored of course, but not when there are thousands of them and I don’t even know what’s old and new. What I want to do is restore the directories under ~/Dropbox to the state that they were in yesterday, before I stuffed up.
Luckily Chris Clark wrote dropbox-restore. It does exactly what it says on the tin. For example:
python restore.py /Camera\ Uploads 2014-07-22
This post is an apology and an attempt to make amends for contributing to the decay of online bioinformatics resources. It’s also, I think, a nice example of why reproducible research can be difficult.
Come back in time with me 10 years, to 2004.
Over the years, I’ve written a lot of small “utility scripts”. You know the kind of thing. Little code snippets that facilitate research, rather than generate research results. For example: just what are the fields that you can use to qualify Entrez database searches?
Typically, they end up languishing in long-forgotten Dropbox directories. Sometimes, the output gets shared as a public link. No longer! As of today, “little code snippets that do (hopefully) useful things” have a new home at Github.
Also as of today: there’s not much there right now, just the aforementioned Entrez database code and output. I’m not out to change the world here, just to do a little better.
There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.
One answer: we begin with exploratory data analysis and never get around to cleaning it up.
An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:
Update: those of you commenting “should have used Python instead” have completely missed the point. Your comments are off-topic and will not be published. Doubly-so when you get snarky about it.
Read the rest…
Just a brief technical note.
I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez:
links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857")
#  741
Browsing the rentrez documentation, I note that db can take the value “all”. Sounds useful!
links <- entrez_link(dbfrom = "pccompound", db = "all", id = "62857")
#  0
That’s odd. In fact, this query does not even link pccompound to gds:
#  39
which(names(links) == "pccompound_gds")
It’s not a rentrez issue, since the same result occurs using the E-utilities URL.
The good people at ropensci have opened an issue, contacting NCBI for clarification. We’ll keep you posted.
I’m pleased to announce an open-access publication with my name on it:
Mitchell, S.M., Ross, J.P., Drew, H.R., Ho, T., Brown, G.S., Saunders, N.F.W., Duesing, K.R., Buckley, M.J., Dunne, R., Beetson, I., Rand, K.N., McEvoy, A., Thomas, M.L., Baker, R.T., Wattchow, D.A., Young, G.P., Lockett, T.J., Pedersen, S.K., LaPointe L.C. and Molloy, P.L. (2014). A panel of genes methylated with high frequency in colorectal cancer. BMC Cancer 14:54.
Admitting to stupidity is part of the learning process. So in the interests of public education, here’s something stupid that I did today.
You’re working in the R console. Happy with your exploratory code, you decide to save it to a file.
savehistory(file = "myCode.R")
Then, you type something else, for example:
# more lines here
And then, decide that you should save again:
savehistory(file = "myCode.R")
You quit the console. Returning to it later, you recall that you saved your code and so can simply run source() to get back to the same point:
Unfortunately, you forget that the sourced file now contains the savehistory() command. Result: since your new history contains only the single line source() command, then that is what gets saved back to the file, replacing all of your lovely code.
Possible solutions include:
- Remember to edit the saved file, removing or commenting out any savehistory() lines
- Generate a file name for savehistory() based on a timestamp so as not to overwrite each time
- Suggested by Scott: include a prompt in the code before savehistory()
When dealing with data from high-throughput experimental platforms such as microarrays, it’s important to account for potential batch effects. A simple example: if you process all your normal tissue samples this week and your cancerous tissue samples next week, you’re in big trouble. Differences between cancer and normal are now confounded with processing time and you may as well start over with new microarrays.
Processing date is often a good surrogate for batch and it was once easy to extract dates from Affymetrix CEL files using Bioconductor. It seems that this is no longer the case.
Read the rest…