Category Archives: R

Some basics of biomaRt

One of the commonest bioinformatics questions, at Biostars and elsewhere, takes the form: “I have a list of identifiers (X); I want to relate them to a second set of identifiers (Y)”. HGNC gene symbols to Ensembl Gene IDs, for example.

When this occurs I have been known to tweet “the answer is BioMart” (there are often other solutions too) and I’ve written a couple of blog posts about the R package biomaRt in the past. However, I’ve realised that we need to take a step back and ask some basic questions that new users might have. How do I find what marts and datasets are available? How do I know what attributes and filters to use? How do I specify different genome build versions?
Continue reading

R 3.1 -> 3.2 upgrade notes

My machines upgraded from R version 3.1.3 to version 3.2.0 last week, which means that existing code suddenly cannot find packages and so fails. Some notes to myself, possibly useful to others, for what to do when this happens. Relevant to Ubuntu-based systems (I use Linux Mint).

1. Update packages

cp ~/R/x86_64-pc-linux-gnu-library/3.1 ~/R/x86_64-pc-linux-gnu-library/3.2
update.packages(checkBuilt=TRUE, ask=FALSE)

1.1. rJava issues
My rJava installation failed because code was trying to compile against jni.h which was not present on my system. Solution:

sudo apt-get install openjdk-7-jdk
sudo R CMD javareconf

and then in R:

install.packages("rJava")

2. Update Bioconductor
Bioconductor is also upgraded so requires more than a package update. Probably need a new R session for this one.

remove.packages("BiocInstaller")
source("http://bioconductor.org/biocLite.R")
biocLite()

2.1. ChemmineR
My Bioconductor Chemminer update failed because package gridExtra was absent:

install.packages("gridExtra")
biocLite("ChemmineR")

3. General issues
When R is installed on Linux Mint, some packages are installed by default in /usr/lib/R/library. When performing updates as a non-root user, you’ll see messages telling you that this location is not writable and asking if you want to use your own library location. If you reply “yes”, you’ll have packages in both system and user locations. It’s probably better to say “no” and let the Ubuntu package management system handle the package upgrades…although when I tried that, the entire upgrade process halted…

And now we are all done so (careful!):

rm -rf ~/R/x86_64-pc-linux-gnu-library/3.1

Project Tycho, ggplot2 and the shameless stealing of blog ideas

Last week, Mick Watson posted a terrific article on using R to recreate the visualizations in this WSJ article on the impact of vaccination. Someone beat me to the obvious joke.

Someone also beat me to the standard response whenever base R graphics are used.

And despite devoting much of Friday morning to it, I was beaten to publication of a version using ggplot2.

Why then would I even bother to write this post. Well, because I did things a little differently; diversity of opinion and illustration of alternative approaches are good. And because on the internet, it’s quite acceptable to appropriate great ideas from other people when you lack any inspiration yourself. And because I devoted much of Friday morning to it.

Here then is my “exploration of what Mick did already, only using ggplot2 like Ben did already.”
Continue reading

Configuring the R BatchJobs package for Torque batch queues

I was asked recently to look at some R code which performs “embarrassingly parallel” computations (the same function, multiple times, different parameters) and see whether I could modify it to run on one of our high-performance computing clusters. The machine has 63 virtual compute nodes and uses the TORQUE batch queue system to allocate nodes to compute jobs.

First stop: the CRAN Task View High-Performance and Parallel Computing with R. Two promising packages there: BatchJobs and BatchExperiments. Their documentation is quite extensive with useful examples, but I found it a little disjointed and confusing. What I wanted was a simple, step-by-step guide to setting up for a first-time user. So here is my attempt. As always, it’s for “Linux-like” systems.
Continue reading

PubMed retraction reporting update

Just a quick update to the previous post. At the helpful suggestion of Steve Royle, I’ve added a new section to the report which attempts to normalise retractions by journal. So for example, J. Biol. Chem. has (as of now) 94 retracted articles and in total 170 842 publications indexed in PubMed. That becomes (100 000 / 170 842) * 94 = 55.022 retractions per 100 000 articles.

Top 20 journals, retracted articles per 100 000 publications

Top 20 journals, retracted articles per 100 000 publications

This leads to some startling changes to the journals “top 20″ list. If you’re wondering what’s going on in the world of anaesthesiology, look no further (thanks again to Steve for the reminder).

PMRetract: PubMed retraction reporting rewritten as an interactive RMarkdown document

Back in 2010, I wrote a web application called PMRetract to monitor retraction notices in the PubMed database. It was written primarily as a way for me to explore some technologies: the Ruby web framework Sinatra, MongoDB (hosted at MongoHQ, now Compose) and Heroku, where the app was hosted.

I automated the update process using Rake and the whole thing ran pretty smoothly, in a “set and forget” kind of way for four years or so. However, the first era of PMRetract is over. Heroku have shut down git pushes to their “Bamboo Stack” – which runs applications using Ruby version 1.8.7 – and will shut down the stack on June 16 2015. Currently, I don’t have the time either to update my code for a newer Ruby version or to figure out the (frankly, near-unintelligible) instructions for migration to the newer Cedar stack.

So I figured now was a good time to learn some new skills, deal with a few issues and relaunch PMRetract as something easier to maintain and more portable. Here it is. As all the code is “out there” for viewing, I’ll just add few notes here regarding this latest incarnation.
Continue reading

Make prettier documents by reusing chunks in RMarkdown

No revelations here, just a little R tip for generating more readable documents.

Screenshot-RStudio.png

Original with lots of code at the top

There are times when I want to show code in a document, but I don’t want it to be the first thing that people see. What I want to see first is the output from that code. In this silly example, I want the reader to focus their attention on the result of myFunction(), which is 49.
Continue reading

Bioinformatics journals: time from submission to acceptance, revisited

Before we start: yes, we’ve been here before. There was the Biostars question “Calculating Time From Submission To Publication / Degree Of Burden In Submitting A Paper.” That gave rise to Pierre’s excellent blog post and code + data on Figshare.

So why are we here again? 1. It’s been a couple of years. 2. This is the R (+ Ruby) version. 3. It’s always worth highlighting how the poor state of publicly-available data prevents us from doing what we’d like to do. In this case the interesting question “which bioinformatics journal should I submit to for rapid publication?” becomes “here’s an incomplete analysis using questionable data regarding publication dates.”

Let’s get it out of the way then.
Continue reading