Monthly Archives: January 2010

ISMB/ECCB 2009 reports

Great to see more reports describing the use of online tools to cover scientific meetings. Here are the publications, from PLoS Computational Biology:

Live Coverage of Scientific Conferences Using Web Technologies.

Live Coverage of Intelligent Systems for Molecular Biology/European Conference on Computational Biology (ISMB/ECCB) 2009.

And here’s Ally a.k.a the robo-blogger on Social Networking and Guidelines for Life Science Conferences.

Looks like we’ve started a trend, long may it continue at future meetings.


I no longer work on protein kinases but when I did, PhosphoGRID is the kind of database that I would have wanted to see. It features:

  • A nice clean interface, with good use of Javascript
  • Useful information returned from a simple search form
  • Data for download in plain text format with no restrictions or requirements for registration

All it lacks is a RESTful API, but nothing is perfect :-)

Published in the little-known but often-useful journal Database:

PhosphoGRID: a database of experimentally verified in vivo protein phosphorylation sites from the budding yeast Saccharomyces cerevisiae.

Clustering GEO samples by title (briefly) revisited

So, we had a brief discussion regarding my previous post and clearly the statement:

The longest key for which values exist classifies your titles

does not hold true for all cases. Not that I ever said that it did! I remind you that this blog is a place for the half-formed ideas that spill out of my head, not an instruction manual.

Let’s look, for example, at GSE19318. This GEO series comprises 2 platforms: one for dog (10 samples) and one for humans (1 sample), with these sample titles:

['Dog-tumor-81705', 'Dog-tumor-78709', 'Dog-tumor-88012', 'Dog-tumor-8888302', 'Dog-tumor-209439', 'Dog-tumor-212227', 'Dog-tumor-48', 'Dog-tumor-125', 'Dog-tumor-394', 'Dog-tumor-896', 'Human-tumor']

Run that through the Ruby code in my last post and we get:

{"Dog-tumor-48"=>["Dog-tumor-48"], "Dog-tumor-81"=>["Dog-tumor-81705"], "Dog-tumor-39"=>["Dog-tumor-394"], "Dog-tumor-20"=>["Dog-tumor-209439"], "Dog-tumor-21"=>["Dog-tumor-212227"], "Dog-tumor-88"=>["Dog-tumor-88012", "Dog-tumor-8888302"], "Dog-tumor-89"=>["Dog-tumor-896"], "Dog-tumor-12"=>["Dog-tumor-125"], "Dog-tumor-78"=>["Dog-tumor-78709"]}

Whoa. That went badly wrong! However, it’s easy to see why. With only one human sample, value.length is not more than one, so that sample disappears altogether. For the dog samples, the longest key is not the key that contains all samples, due to the title naming scheme.

We might try instead to maximize the value length – that is, the array value which contains the most samples:

  # longest value.length
  count = 0
  hash.each_pair do |key,value|
    count = value.length if count < value.length
  # delete unwanted keys
  hash.delete_if { |key,value| value.length != count }
  return hash

Which will give us a choice of “dog sample keys”, but still drops the human sample:

["Dog-tumo", "Do", "Dog-tumor-", "D", "Dog-tum", "Dog-t", "Dog", "Dog-tu", "Dog-tumor", "Dog-"]

Other things we might try:

  • Partition sample titles by platform before trying to partition samples by title
  • Don’t delete any hash keys based on key/value length; just present all options to the user
  • Decide that sample partitioning by title is a poor idea and try a different approach

As ever, life would be much easier if GEO samples were titled or described in some logical, parse-able fashion.

“Thinking algorithmically”: clustering GEO samples by title

Today’s challenge. Take a look at this array, which contains the “title” field for the 6 samples from GSE1323, a series in the GEO microarray database:


Humans are very good at classification. Almost instantly, you’ll see that there are 2 classes, “SW-480″ and “SW-620″, each with 3 samples. How can we write a program to do the same job?
Read the rest…

What I learned this week about: productivity, MongoDB and nginx

It was an excellent week. It was also a week in which I payed much less attention than usual to Twitter and FriendFeed. Something to think about…

I’m now using MongoDB so frequently that I’d like the database server to start up on boot and stay running. There are 2 init scripts for Debian/Ubuntu in this Google Groups thread. I followed the instructions in the second post, with minor modifications to the script:

# I like to keep mongodb in /opt
# you don't need 'run' in these options
DAEMON_OPTS='--dbpath /data/db'
# I prefer lower-case name for the PID file
# I think /data/db should be owned by the mongodb user
sudo chown -R mongodb /data/db
# don't forget to make script executable and update-rc.d
sudo /usr/sbin/update-rc.d mongodb defaults

nginx (and Rails and mod_rails)
This week, I deployed my first working Rails application to a remote server. It’s nothing fancy – just a database front-end that some colleagues need to access. I went from no knowledge to deployment in half a day, thanks largely to this article, describing how to make nginx work with Phusion Passenger.

Along the way, I encountered a common problem: how do you serve multiple applications, each living in a subdirectory?

# You have:
# You want these URLs to work:

Here’s how to do that:

# Make sure the Rails public/ directories are owned by the web server user:
sudo chown -R www-data.www-data /var/www/myapp1/public /var/www/myapp2/public
# Make symbolic links from public/ to the location which will be the web server document root
sudo ln -s /var/www/myapp1/public /var/www/rails/app1
sudo ln -s /var/www/myapp2/public /var/www/rails/app2
# These are the key lines in the server section of nginx.conf:
server {
        listen       8080;
        root /var/www/rails;
        passenger_enabled on;
        passenger_base_uri /app1;
        passenger_base_uri /app2;

That’s me for the week.

A new twist on the identifier mapping problem

Yesterday, Deepak wrote about BridgeDB, a software package to deal with the “identifier mapping problem”. Put simply, biologists can name a biological entity in any way that they like, leading to multiple names for the same object. Easily solved, you might think, by choosing one identifier and sticking to it, but that’s apparently way too much of a challenge.

However, there are times when this situation is forced upon us. Consider this code snippet, which uses the Bioconductor package GEOquery via the RSRuby library to retrieve a sample from the GEO database:

require "rubygems"
require "rsruby"

if ENV['R_HOME'].nil?
  ENV['R_HOME'] = "/usr/lib/R"

r = RSRuby.instance
sample = r.getGEO("GSM434143")
table  = r.Table(sample)
keys   = table.keys
puts keys

All good so far. What if I try to save the data table, which contains entries such as { “DETECTION.P.VALUE” => “0.000146581″ }, to my new favourite database, MongoDB?

key must not contain '.'

So what am I to do, other than modify the key using something like:

newkey = key.gsub(/\./, "_")

VoilĂ , my own personal contribution to the identifier mapping problem.

What’s the solution? Here are some options – rank them in order of silliness if you like:

  • Biological databases should avoid potentially “troublesome” keys
  • Database designers should allow any symbols in keys
  • Database driver writers should include methods to check keys and alter them if necessary
  • End users should create their own maps by storing the original key with the modified version

Samples per series/dataset in the NCBI GEO database

Andrew asks:

I want to get an NCBI GEO report showing the number of samples per series or data set. Short of downloading all of GEO, anyone know how to do this? Is there a table of just metadata hidden somewhere?

At work, we joke that GEO is the only database where data goes in, but it won’t come out. However, there is an alternative: the GEOmetadb package, available from Bioconductor.

The R code first, then some explanation:

# install GEOmetadb

# connect to database
con <- dbConnect(SQLite(), "GEOmetadb.sqlite")

# count samples per GDS
gds.count <- dbGetQuery(con, "select gds,sample_count from gds")
# first 5 results
     gds sample_count
1   GDS5            5
2   GDS6           29
3  GDS10           28
4  GDS12            8
5  GDS15            6
# count samples per GSE
gse <- dbGetQuery(con, "select series_id from gsm")
gse.count <-$series_id))
# first 10 results
                Var1 Freq
1               GSE1   38
2              GSE10    4
3             GSE100    4
4           GSE10000   29
5           GSE10001   12
6           GSE10002    8
7           GSE10003    4
8  GSE10004,GSE10114    3
9           GSE10005   48
10          GSE10006   75

We install GEOmetadb (lines 2-4), then download and unpack the SQLite database (line 7). This generates the file ~/GEOmetadb.sqlite, which is currently a little over 1 GB.

Next, we connect to the database via RSQLite (lines 7-8). The gds table contains GDS dataset accession and sample count, so extracting that information is very easy (line 11).

GSE series are a little different. The gsm table contains GSM sample accession and GSE series accession (in the series_id field). We can count up the samples per series using table(), on line 22. However, this generates some odd-looking results, such as:

          Var1          Freq
15    GSE10011,GSE10026 45
14652 GSE9973,GSE10026   9
14654 GSE9975,GSE10026  36
14656 GSE9977,GSE10026  24

Fear not. In this case, GSE10026 is a super-series comprised from the series GSE10011 (45 samples), GSE9973 (9 samples), GSE9975 (36 samples) and GSE9977 (24 samples), total = 114 samples.

How to: archive data via an API using Ruby and MongoDB

I was going to title this post “How to: archive a FriendFeed feed in MongoDB”. The example code does just that but (a) I fear that this blog suggests a near-obsession with FriendFeed (see tag cloud, right sidebar) and (b) the principles apply to any API that returns JSON. There are rare examples of biological data with JSON output in the wild, e.g. the ArrayExpress Gene Expression Atlas. So I’m still writing a bioinformatics blog ;-)

Let’s go straight to the code:


require "rubygems"
require "mongo"
require "json/pure"
require "open-uri"

# db config
db  ='friendfeed')
col = db.collection('lifesci')

# fetch json
0.step(9900, 100) {|n|
  f = open("{n}&amp;num=100").read
  j = JSON.parse(f)
  break if j['entries'].count == 0
  j['entries'].each do |entry|
    if col.find({:_id =&gt; entry['id']}).count == 0
      entry[:_id] = entry['id']
  puts "Processed entries #{n} - #{n + 99}", "Database contains #{col.count} documents."

puts "No more entries to process. Database contains #{col.count} documents."

Also available as a gist. Fork away.

A quick run-through. Lines 4-6 load the required libraries: mongo (the mongodb ruby driver), json and open-uri. If you don’t have the first two, simply “gem install mongo json_pure”. Of course, you’ll need to download MongoDB and have the mongod server daemon running on your system.

Lines 9-10 connect to the database (assuming a standard database installation). Rename the database and collection as you see fit. Both will be created if they don’t exist.

The guts are lines 12-25. A loop fetches JSON from the FriendFeed API, 100 entries at a time (0-99, 100-199…) up to 9999. That’s an arbitrarily-high number, to ensure that all entries are retrieved. Change “the-life-scientists” in line 14 to the feed of your choice. The JSON is then parsed into a hash structure. In lines 17-23 we loop through each entry and extract the “id” key, a unique identifier for the entry. This is used to create the “_id” field, a unique identifier for the MongoDB document. If a document with _id == id does not exist we create an _id key in the hash, delete the (now superfluous) ‘id’ key and save the document. Otherwise, the entry is skipped.
At some point the API will return no more entries: { “entries” : [] }. When this happens, we exit the block (line 16) and print a summary.

That’s it, more or less. Obviously, the script would benefit from some error checking and more options (such as supplying a feed URL as a command line option). For entries with attached files, the file URL but not the attachment will be saved. A nice improvement would be to fetch the attachment and save it to the database, using GridFS.

Possible uses: a simple archive, a backend for a web application to analyse the feed.

Reblog this post [with Zemanta]