How to: remember that you once knew how to parse KEGG

Recently, someone asked me if I could generate a list of genes associated with a particular pathway. Sure, I said and hacked together some rather nasty code in R which, given a KEGG pathway identifier, used a combination of the KEGG REST API, DBGET and biomaRt to return HGNC symbols.

Coincidentally, someone asked the same question at Biostar. Pierre recommended the TogoWS REST service, which provides an API to multiple biological data sources. An article describing TogoWS was published in 2010.

An excellent suggestion – and one which, I later discovered, I had bookmarked. Twice. As long ago as 2008. This “rediscovery of things I once knew” happens to me with increasing frequency now, which makes me wonder whether (1) we really are drowning in information, (2) my online curation tools/methods require improvement or (3) my mind is not what it was. Perhaps some combination of all three.

Anyway – using Ruby (1.8.7), a list of HGNC symbols given a KEGG pathway, e.g. MAPK signaling, is as simple as:

require 'rubygems'
require 'open-uri'
require 'json/pure'

j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa04010/genes.json").read)
g = j.first.values.map {|v| /^(.*?);/.match(v)[1] }
# first 5 genes
g[0..4]
# ["MAP3K14", "FGF17", "FGF6", "DUSP9", "MAP3K6"]

This code parses the JSON returned from TogoWS into an array with one element; the element is a hash with key/value pairs of the form:

"9020"=>"MAP3K14; mitogen-activated protein kinase kinase kinase 14 [KO:K04466] [EC:2.7.11.25]"

Values for all keys that I’ve seen to date begin with the HGNC symbol followed by a semicolon, making extraction quite straightforward with a simple regular expression.

PMRetract: now with rake tasks

Bioinformaticians (and anyone else who programs) love effective automation of mundane tasks. So it may amuse you to learn that I used to update PMRetract, my PubMed retraction notice monitoring application, by manually running the following steps in order:

  1. Run query at PubMed website with term “Retraction of Publication[Publication Type]”
  2. Send results to XML file
  3. Run script to update database with retraction and total publication counts for years 1977 – present
  4. Run script to update database with retraction notices
  5. Run script to update database with retraction timeline
  6. Commit changes to git
  7. Push changes to Github
  8. Dump local database to file
  9. Restore remote database from file
  10. Restart Heroku application

I’ve been meaning to wrap all of that up in a Rakefile for some time. Finally, I have. Along the way, I learned something about using efetch from BioRuby and re-read one of my all-time favourite tutorials, on how to write rake tasks. So now, when I receive an update via RSS, updating should be as simple as:

rake pmretract

In other news: it’s been quiet here, hasn’t it? I recently returned from 4 weeks overseas, packed up my office and moved to a new building. Hope to get back to semi-regular posts before too long.

Monitoring PubMed retractions: updates

chart

PubMed cumulative retractions 1977-present

There’s been a recent flurry of interest in retractions. See for example: Scientific Retractions: A Growth Industry?; summarised also by GenomeWeb in Take That Back; articles in the WSJ and the Pharmalot blog; and academic articles in the Journal of Medical Ethics and Infection & Immunity.

Several of these sources cite data from my humble web application, PMRetract. So now seems like a good time to mention that:

  • The application is still going strong and is updated regularly
  • I’ve added a few enhancements to the UI; you can follow development at GitHub
  • I’ve also added a long-overdue about page with some extra information, including the fact that I wrote it :)

Now I just need to fix up my Git repositories. Currently there’s one which pushes to GitHub and a second, with a copy of the Sinatra code for pushing to Heroku, which isn’t too smart.

BioRuby development: feedback on using Git

Everyone likes constructive feedback. I received a couple of great comments on my previous post, which warrant a brief discussion.

@vlandham points out that when the main BioRuby repository updates, you’ll want to update your local repository. Using git, you do that by adding a remote which points to the original repository, from which you can fetch updates and merge with your local version:

git remote add upstream https://github.com/bioruby/bioruby.git
# fetch/merge only when main repo updates
git fetch upstream
git merge upstream master

This is described at the GitHub help page Fork A Repo.

Michael points to an article titled A successful Git branching model. It suggests that when developing new features you create a feature branch (also called topic branch). This can help with the management of new features and creates a more complete commit history if/when the new feature is merged back into your development repository. The article also suggests a main branch for development named develop, rather than the default master.

I haven’t quite got my head around all the ins-and-outs of the article yet, but it’s well worth a read.

A beginner’s guide to BioRuby development

I’m the “biologist-turned-programmer” type of bioinformatician which makes me a hacker, not a developer. Most of the day-to-day coding that I do goes something like this:

Colleague: Hey Neil, can you write me a script to read data from file X, do Y to it and output a table in file Z?
Me: Sure… (clickety-click, hackety-hack…) …there you go.
Colleague: Great! Thanks.

I’m a big fan of the Bio* projects and have used them for many years, beginning with Bioperl and more recently, BioRuby. And I’ve always wanted to contribute some code to them, but have never got around to doing so. This week, two thoughts popped into my head:

  • How hard can it be?
  • There isn’t much introductory documentation for would-be Bio* developers

The answer to the first question is: given some programming experience, not very hard at all. This blog post is my attempt to address the second thought, by writing a step-by-step guide to developing a simple class for the BioRuby library. When I say “beginner’s guide”, I’m referring to myself as much as anyone else.
Read the rest…

Friday fun projects

What’s a “Friday fun project”? It’s a small computing project, perfect for a Friday afternoon, which serves the dual purpose of (1) keeping your programming/data analysis skills sharp and (2) providing a mental break from the grind of your day job. Ideally, the skills learned on the project are useful and transferable to your work projects.

This post describes one of my Friday fun projects: how good is the Sydney weather forecast? There’ll be some Ruby, some R and some MongoDB. Bioinformatics? No, but some thoughts on that at the end.
Read the rest…

Ruby Version Manager: the absolute basics

Confession: I’m still using Ruby version 1.8.7, from the Ubuntu 10.10 repository. Why? I have a lot of working code, I don’t know how well it works using Ruby 1.9 and I’m worried that migration will break things and make me miserable.

Various people have pointed me to RVM – Ruby Version Manager. As the name suggests, it allows you to manage multiple Ruby versions on your machine. Today, I needed to test an application written for Ruby 1.9.2, so I used RVM for the first time. Here are the absolute basics, for anyone who just wants to test some code using Ruby 1.9, without messing up their existing system:

# install rvm
bash < <( curl http://rvm.beginrescueend.com/releases/rvm-install-head )
# add it to your .bashrc
echo '[[ -s "$HOME/.rvm/scripts/rvm" ]] && source "$HOME/.rvm/scripts/rvm"' >> ~/.bashrc
# add tab completion to your .bashrc
echo '[[ -r $rvm_path/scripts/completion ]] && . $rvm_path/scripts/completion' >> ~/.bashrc
# source the script; check installation
source ~/.rvm/scripts/rvm
rvm -v
# returns (e.g.) rvm 1.2.8 by Wayne E. Seguin (wayneeseguin@gmail.com) [http://rvm.beginrescueend.com/]
# install ruby 1.9.2
rvm install 1.9.2
# use ruby 1.9.2
rvm use 1.9.2
ruby -v
# returns (e.g.) ruby 1.9.2p180 (2011-02-18 revision 30909) [i686-linux]
# do stuff as normal (install gems etc.)
# when you're done, switch back to system ruby
rvm system
ruby -v
# returns (e.g.) ruby 1.8.7 (2010-06-23 patchlevel 299) [i686-linux]

That’s it. The key thing is that RVM sets up Ruby in $HOME/.rvm so for example, when using version 1.9.2 under RVM, gem install GEMNAME will install to $HOME/.rvm/gems/ruby-1.9.2-p180/gems/. Your system files are untouched.

Getting “stuff” into MongoDB

One of the aspects I like most about MongoDB is the “store first, ask questions later” approach. No need to worry about table design, column types or constant migrations as design changes. Provided that your data are in some kind of hash-like structure, you just drop them in.

Ruby is particularly useful for this task, since it has many gems which can parse common formats into a hash. Here are 3 quick examples with relevance to bioinformatics.
Read the rest…