I’ve been meaning to write about Entrez Direct, henceforth called edirect, for some time. This tweet provided me with an excuse:
I was asked recently to look at some R code which performs “embarrassingly parallel” computations (the same function, multiple times, different parameters) and see whether I could modify it to run on one of our high-performance computing clusters. The machine has 63 virtual compute nodes and uses the TORQUE batch queue system to allocate nodes to compute jobs.
First stop: the CRAN Task View High-Performance and Parallel Computing with R. Two promising packages there: BatchJobs and BatchExperiments. Their documentation is quite extensive with useful examples, but I found it a little disjointed and confusing. What I wanted was a simple, step-by-step guide to setting up for a first-time user. So here is my attempt. As always, it’s for “Linux-like” systems.
Back in 2010, I wrote a web application called PMRetract to monitor retraction notices in the PubMed database. It was written primarily as a way for me to explore some technologies: the Ruby web framework Sinatra, MongoDB (hosted at MongoHQ, now Compose) and Heroku, where the app was hosted.
I automated the update process using Rake and the whole thing ran pretty smoothly, in a “set and forget” kind of way for four years or so. However, the first era of PMRetract is over. Heroku have shut down git pushes to their “Bamboo Stack” – which runs applications using Ruby version 1.8.7 – and will shut down the stack on June 16 2015. Currently, I don’t have the time either to update my code for a newer Ruby version or to figure out the (frankly, near-unintelligible) instructions for migration to the newer Cedar stack.
So I figured now was a good time to learn some new skills, deal with a few issues and relaunch PMRetract as something easier to maintain and more portable. Here it is. As all the code is “out there” for viewing, I’ll just add few notes here regarding this latest incarnation.
I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.
Take the seemingly-simple question “how many retracted articles are there in PubMed?”
No revelations here, just a little R tip for generating more readable documents.There are times when I want to show code in a document, but I don’t want it to be the first thing that people see. What I want to see first is the output from that code. In this silly example, I want the reader to focus their attention on the result of myFunction(), which is 49.
Before we start: yes, we’ve been here before. There was the Biostars question “Calculating Time From Submission To Publication / Degree Of Burden In Submitting A Paper.” That gave rise to Pierre’s excellent blog post and code + data on Figshare.
So why are we here again? 1. It’s been a couple of years. 2. This is the R (+ Ruby) version. 3. It’s always worth highlighting how the poor state of publicly-available data prevents us from doing what we’d like to do. In this case the interesting question “which bioinformatics journal should I submit to for rapid publication?” becomes “here’s an incomplete analysis using questionable data regarding publication dates.”
Let’s get it out of the way then.
Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.
I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.
There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.
One answer: we begin with exploratory data analysis and never get around to cleaning it up.
An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:
Update: those of you commenting “should have used Python instead” have completely missed the point. Your comments are off-topic and will not be published. Doubly-so when you get snarky about it.