Friday fun projects

May 14, 2011May 20, 2011 / nsaunders / 1 Comment

What’s a “Friday fun project”? It’s a small computing project, perfect for a Friday afternoon, which serves the dual purpose of (1) keeping your programming/data analysis skills sharp and (2) providing a mental break from the grind of your day job. Ideally, the skills learned on the project are useful and transferable to your work projects.

This post describes one of my Friday fun projects: how good is the Sydney weather forecast? There’ll be some Ruby, some R and some MongoDB. Bioinformatics? No, but some thoughts on that at the end.
Read the rest…

Getting “stuff” into MongoDB

February 14, 2011 / nsaunders / 3 Comments

One of the aspects I like most about MongoDB is the “store first, ask questions later” approach. No need to worry about table design, column types or constant migrations as design changes. Provided that your data are in some kind of hash-like structure, you just drop them in.

Ruby is particularly useful for this task, since it has many gems which can parse common formats into a hash. Here are 3 quick examples with relevance to bioinformatics.
Read the rest…

Monitoring PubMed retractions: a Heroku-hosted Sinatra application

December 22, 2010December 22, 2010 / nsaunders / 1 Comment

In a previous post analysing retractions from PubMed, I wrote:

It strikes me that it would be relatively easy to build a web application (Rails, Heroku), which constantly monitors retraction data at PubMed and generates a variety of statistics and charts.

“Relatively easy” it was. Let me introduce you to PMRetract, my first publicly-available web application.
Read the rest…

Connecting to a MongoDB database from R using Java

September 24, 2010May 20, 2011 / nsaunders / 3 Comments

It would be nice if there were an R package, along the lines of RMySQL, for MongoDB. For now there is not – so, how best to get data from a MongoDB database into R?

One option is to retrieve JSON via the MongoDB REST interface and parse it using the rjson package. Assuming, for example, that you have retrieved your CiteULike collection in JSON format from this URL:

http://www.citeulike.org/json/user/neils

– and saved it to a database named citeulike in a collection named articles, you can fetch the first 5 articles into R like so:

library(RCurl)
library(rjson)

db <- "http://localhost:28017/citeulike/articles/?limit=5"
articles <- fromJSON(getURL(db))
articles$rows[[1]]$title
# [1] "A computational genomics pipeline for prokaryotic sequencing projects"

That works, but you may not want to use the MongoDB REST interface: for example, it may be slow for large queries or there might be security concerns.

MongoDB has both C and Java drivers. R has packages that interface with these languages: .C/.Call and rJava, respectively. My only problem is that I can write what I know about C and Java on the back of a postage stamp.

Not to be deterred, I took the approach that has served me well my whole professional life: wing it, using what I could glean from Google searches and the Web. In the end, using Java in R to connect with MongoDB was surprisingly easy. Here’s a basic how-to.
Read the rest…

Backup your CiteULike library using MongoDB and Ruby

August 20, 2010August 20, 2010 / nsaunders / 2 Comments

Well, that was easy.

#!/usr/bin/ruby
require "rubygems"
require "mongo"
require "json/pure"
require "open-uri"

db  = Mongo::Connection.new.db('citeulike')
col = db.collection('articles')
j   = JSON.parse(open("http://www.citeulike.org/json/user/neils").read)

j.each do |article|
  article[:_id] = article['article_id']
  col.save(article)
end

MongoDB, Mongoid and Map/Reduce

August 9, 2010 / nsaunders / 8 Comments

I’ve been experimenting with MongoDB’s map-reduce, called from Ruby, as a replacement for Ruby’s Enumerable methods (map/collect, inject). It’s faster. Much faster.

Next, the details but first – the disclaimers:

My database is not optimised (using, e.g. indices)
My non-map/reduce code is not optimised; I’m sure there are better ways to perform database queries and use Enumerable than those in this post
My map/reduce code is not optimised – these are my first attempts

In short nothing is optimised, my code is probably awful and I’m making it up as I go along. Here we go then!
Read the rest…

MongoDB: post-discussion thoughts

August 8, 2010 / nsaunders / 1 Comment

It’s good to talk. In my previous post, I aired a few issues concerning MongoDB database design. There’s nothing like including a buzzword (“NoSQL”) in your blog posts to bring out comments from the readers. My thoughts are much clearer, thanks in part to the discussion – here are the key points:

It’s OK to be relational
When you read about storing JSON in MongoDB (or similar databases), it’s easy to form the impression that the “correct” approach is always to embed documents and that if you use relational associations, you have somehow “failed”. This is not true. If you feel that your design and subsequent operations benefit from relational association between collections, go for it. That’s why MongoDB provides the option – for flexibility.

Focus on the advantages
Remember why you chose MongoDB in the first place. In my case, a big reason was easy document saves. Even though I’ve chosen two or more collections in my design, the approach still affords advantages. The documents in those collections still contain arrays and hashes, which can be saved “as is”, without parsing. Not to mention that in a Rails environment, doing away with migrations is, in my experience, a huge benefit.

Get comfortable with map-reduce – now!
It’s tempting to pull records out of MongoDB and then “loop through” them, or use Ruby’s Enumerable methods (e.g. collect, inject) to process the results. If this is your approach, stop, right now and read Map-reduce basics by Kyle Banker. Then, start converting all of your old Ruby code to use map-reduce instead.

Before map-reduce, pages in my Rails application were taking seconds – sometimes tens of seconds – to render, when fetching only hundreds or thousands of database records. After: a second or less. That was my map-reduce epiphany and I’ll describe the details in the next blog post.

The “NoSQL” approach: struggling to see the benefits

August 5, 2010 / nsaunders / 12 Comments

Document-oriented data modeling is still young. The fact is, many more applications will need to be built on the document model before we can say anything definitive about best practices.
— MongoDB Data Modeling and Rails

ISMB 2009 feed, entries by date

This quote from the MongoDB website sums up, for me, the key problem in moving to a document-oriented, schema-free database: design. It’s easy to end up with a solution which resembles a relational database to the extent that you begin to wonder – if you should not just use a relational database. I’d like to illustrate by example.
Read the rest…

MongoDB and Ubuntu 10.04

May 6, 2010 / nsaunders

Fans of MongoDB and Ubuntu, rejoice. Installation just got easier, with the appearance of mongodb in the Ubuntu repositories.

However – the latest version in lucid is 1.2.2, whereas you want the very latest, 1.4.2. All the instructions are here. As usual, it’s just a case of adding a line to your sources list, importing a GPG key and then:

sudo apt-get update
sudo apt-get install mongodb-stable

Configuration lives in /etc/mongodb.conf, databases live in /var/lib/mongodb, logging goes to /var/log/mongodb/*.log.

What I learned this week about: productivity, MongoDB and nginx

January 15, 2010 / nsaunders

Productivity
It was an excellent week. It was also a week in which I payed much less attention than usual to Twitter and FriendFeed. Something to think about…

MongoDB
I’m now using MongoDB so frequently that I’d like the database server to start up on boot and stay running. There are 2 init scripts for Debian/Ubuntu in this Google Groups thread. I followed the instructions in the second post, with minor modifications to the script:

# I like to keep mongodb in /opt
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/mongodb/bin
DAEMON=/opt/mongodb/bin/mongod
# you don't need 'run' in these options
DAEMON_OPTS='--dbpath /data/db'
# I prefer lower-case name for the PID file
NAME=mongodb
# I think /data/db should be owned by the mongodb user
sudo chown -R mongodb /data/db
# don't forget to make script executable and update-rc.d
sudo /usr/sbin/update-rc.d mongodb defaults

nginx (and Rails and mod_rails)
This week, I deployed my first working Rails application to a remote server. It’s nothing fancy – just a database front-end that some colleagues need to access. I went from no knowledge to deployment in half a day, thanks largely to this article, describing how to make nginx work with Phusion Passenger.

Along the way, I encountered a common problem: how do you serve multiple applications, each living in a subdirectory?

# You have:
/var/www/rails/myapp1
/var/www/rails/myapp2
# You want these URLs to work:
http://my.server.name:8080/app1
http://my.server.name:8080/app2

Here’s how to do that:

# Make sure the Rails public/ directories are owned by the web server user:
sudo chown -R www-data.www-data /var/www/myapp1/public /var/www/myapp2/public
# Make symbolic links from public/ to the location which will be the web server document root
sudo ln -s /var/www/myapp1/public /var/www/rails/app1
sudo ln -s /var/www/myapp2/public /var/www/rails/app2
# These are the key lines in the server section of nginx.conf:
server {
        listen       8080;
        server_name  my.server.name;
        root /var/www/rails;
        passenger_enabled on;
        passenger_base_uri /app1;
        passenger_base_uri /app2;
}

That’s me for the week.

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

mongodb