Monthly Archives: September 2010

Connecting to a MongoDB database from R using Java

It would be nice if there were an R package, along the lines of RMySQL, for MongoDB. For now there is not – so, how best to get data from a MongoDB database into R?

One option is to retrieve JSON via the MongoDB REST interface and parse it using the rjson package. Assuming, for example, that you have retrieved your CiteULike collection in JSON format from this URL:


http://www.citeulike.org/json/user/neils

- and saved it to a database named citeulike in a collection named articles, you can fetch the first 5 articles into R like so:

library(RCurl)
library(rjson)

db <- "http://localhost:28017/citeulike/articles/?limit=5"
articles <- fromJSON(getURL(db))
articles$rows[[1]]$title
# [1] "A computational genomics pipeline for prokaryotic sequencing projects"

That works, but you may not want to use the MongoDB REST interface: for example, it may be slow for large queries or there might be security concerns.

MongoDB has both C and Java drivers. R has packages that interface with these languages: .C/.Call and rJava, respectively. My only problem is that I can write what I know about C and Java on the back of a postage stamp.

Not to be deterred, I took the approach that has served me well my whole professional life: wing it, using what I could glean from Google searches and the Web. In the end, using Java in R to connect with MongoDB was surprisingly easy. Here’s a basic how-to.
Read the rest…

Nodalpoint: now in glorious stereophonic audio

Nodalpoint Conversations is, in Greg’s words, “Nodalpoint rebooted as a podcast”. Long-time readers will remember Nodalpoint, a bioinformatics community where many of us first came together.

In the “pilot” episode, Greg and I chat (via Skype between Leiden and Sydney) about all things bioinformatics. It was a new experience for me and one which I greatly enjoyed, despite being somewhat unsettled by the sound of my own recorded voice.

Check out Episode 1 of Nodalpoint Conversations. It’s as close to proof of my existence as you’ll get. Then follow nodalconv on Twitter.

Trust no-one: errors and irreproducibility in public data

Just when I was beginning to despair at the state of publicly-available microarray data, someone sent me an article which…increased my despair.

The article is:

Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology (2009)
Keith A. Baggerly and Kevin R. Coombes
Ann. Appl. Stat. 3(4): 1309-1334

It escaped my attention last year, in part because “Annals of Applied Statistics” is not high on my journal radar. However, other bloggers did pick it up: see posts at Reproducible Research Ideas and The Endeavour.

In this article, the authors examine several papers in their words “purporting to use microarray-based signatures of drug sensitivity derived from cell lines to predict patient response.” They find that not only are the results difficult to reproduce but in several cases, they simply cannot be reproduced due to simple, avoidable errors. In the introduction, they note that:

…a recent survey [Ioannidis et al. (2009)] of 18 quantitative papers published in Nature Genetics in the past two years found reproducibility was not achievable even in principle for 10.

You can get an idea of how bad things are by skimming through the sub-headings in the article. Here’s a selection of them:
Read the rest…