Getting “stuff” into MongoDB

One of the aspects I like most about MongoDB is the “store first, ask questions later” approach. No need to worry about table design, column types or constant migrations as design changes. Provided that your data are in some kind of hash-like structure, you just drop them in.

Ruby is particularly useful for this task, since it has many gems which can parse common formats into a hash. Here are 3 quick examples with relevance to bioinformatics.
Read the rest…

APIs have let me down part 1/2: ArrayExpress

The API – Application Programming Interface – is, in principle, a wonderful thing. You make a request to a server using a URL and back come lovely, structured data, ready to parse and analyse. We’ve begun to demand that all online data sources offer an API and lament the fact that so few online biological databases do so.

Better though, to have no API at all than one which is poorly implemented and leads to frustration? I’m beginning to think so, after recent experiences on both a work project and one of my “fun side projects”. Let’s start with the work project, an attempt to mine a subset of the ArrayExpress microarray database.
Read the rest…

What the world needs is: lists of Entrez database fields

You know the problem. You want to qualify your NCBI/Entrez database search term using a field. For example: “autism[TIAB]”, to search PubMed for the word autism in either Title or Abstract. Problem – you can’t find a list of fields specific to that database.

Now you can. Follow the links in this public Dropbox file, to see a CSV file containing name, full name and description of the fields for each Entrez database.

Code to generate the files is listed below. This may or may not be the first in an occasional, irregular “what the world needs” series.

require 'rubygems'
require 'bio'
require 'hpricot'
require 'open-uri'

Bio::NCBI.default_email = ""
ncbi =

ncbi.einfo.each do |db|
  puts "Processing #{db}...""#{db}.txt", "w") do |f|
    doc = Hpricot(open("{db}"))
    (doc/'//fieldlist/field').each do |field|
      name = (field/'/name').inner_html
      fullname = (field/'/fullname').inner_html
      description = (field/'description').inner_html

Findings increasingly novel, scientists say…

…was the tongue-in-cheek title of an image that I posted to Twitpic this week. It shows the usage of the word “novel” in PubMed article titles over time. As someone correctly pointed out at FriendFeed, it needs to be corrected for total publications per year.

It was inspired by a couple of items that caught my attention. First, a question at BioStar with the self-explanatory title Locations of plots of quantities of publicly available biological data. Second, an item at FriendFeed musing on the (over?) use of the word “insight” in scientific publications.

I’m sure that quite recently, I’ve read a letter to a journal which analysed the use of phrases such as “novel insights” in articles over time, but it’s currently eluding my search skills. So here’s my simple roll-your-own approach, using a little Ruby and R.
Read the rest…

MongoDB: post-discussion thoughts

It’s good to talk. In my previous post, I aired a few issues concerning MongoDB database design. There’s nothing like including a buzzword (“NoSQL”) in your blog posts to bring out comments from the readers. My thoughts are much clearer, thanks in part to the discussion – here are the key points:

It’s OK to be relational
When you read about storing JSON in MongoDB (or similar databases), it’s easy to form the impression that the “correct” approach is always to embed documents and that if you use relational associations, you have somehow “failed”. This is not true. If you feel that your design and subsequent operations benefit from relational association between collections, go for it. That’s why MongoDB provides the option – for flexibility.

Focus on the advantages
Remember why you chose MongoDB in the first place. In my case, a big reason was easy document saves. Even though I’ve chosen two or more collections in my design, the approach still affords advantages. The documents in those collections still contain arrays and hashes, which can be saved “as is”, without parsing. Not to mention that in a Rails environment, doing away with migrations is, in my experience, a huge benefit.

Get comfortable with map-reduce – now!
It’s tempting to pull records out of MongoDB and then “loop through” them, or use Ruby’s Enumerable methods (e.g. collect, inject) to process the results. If this is your approach, stop, right now and read Map-reduce basics by Kyle Banker. Then, start converting all of your old Ruby code to use map-reduce instead.

Before map-reduce, pages in my Rails application were taking seconds – sometimes tens of seconds – to render, when fetching only hundreds or thousands of database records. After: a second or less. That was my map-reduce epiphany and I’ll describe the details in the next blog post.

The “NoSQL” approach: struggling to see the benefits

Document-oriented data modeling is still young. The fact is, many more applications will need to be built on the document model before we can say anything definitive about best practices.
MongoDB Data Modeling and Rails


ISMB 2009 feed, entries by date

This quote from the MongoDB website sums up, for me, the key problem in moving to a document-oriented, schema-free database: design. It’s easy to end up with a solution which resembles a relational database to the extent that you begin to wonder – if you should not just use a relational database. I’d like to illustrate by example.
Read the rest…

MongoDB and Ubuntu 10.04

Fans of MongoDB and Ubuntu, rejoice. Installation just got easier, with the appearance of mongodb in the Ubuntu repositories.

However – the latest version in lucid is 1.2.2, whereas you want the very latest, 1.4.2. All the instructions are here. As usual, it’s just a case of adding a line to your sources list, importing a GPG key and then:

sudo apt-get update
sudo apt-get install mongodb-stable

Configuration lives in /etc/mongodb.conf, databases live in /var/lib/mongodb, logging goes to /var/log/mongodb/*.log.

A new twist on the identifier mapping problem

Yesterday, Deepak wrote about BridgeDB, a software package to deal with the “identifier mapping problem”. Put simply, biologists can name a biological entity in any way that they like, leading to multiple names for the same object. Easily solved, you might think, by choosing one identifier and sticking to it, but that’s apparently way too much of a challenge.

However, there are times when this situation is forced upon us. Consider this code snippet, which uses the Bioconductor package GEOquery via the RSRuby library to retrieve a sample from the GEO database:

require "rubygems"
require "rsruby"

if ENV['R_HOME'].nil?
  ENV['R_HOME'] = "/usr/lib/R"

r = RSRuby.instance
sample = r.getGEO("GSM434143")
table  = r.Table(sample)
keys   = table.keys
puts keys

All good so far. What if I try to save the data table, which contains entries such as { “DETECTION.P.VALUE” => “0.000146581” }, to my new favourite database, MongoDB?

key must not contain '.'

So what am I to do, other than modify the key using something like:

newkey = key.gsub(/\./, "_")

VoilĂ , my own personal contribution to the identifier mapping problem.

What’s the solution? Here are some options – rank them in order of silliness if you like:

  • Biological databases should avoid potentially “troublesome” keys
  • Database designers should allow any symbols in keys
  • Database driver writers should include methods to check keys and alter them if necessary
  • End users should create their own maps by storing the original key with the modified version

Has our quest for completeness made things too complicated?

In my opinion, yes. Let me elaborate.

My current job is very much focused on “data integration”. What this means is that we have a large amount of diverse data from different “-omics” experiments: microarrays, protein mass spectrometry, DNA sequencing – really, whatever you like, but it’s all aimed at answering the same question. Namely: which of these biological entities (transcripts, proteins, metabolites) are markers for various human disease states?
Read the rest…