April 22, 2013

How to: remember that you once knew how to parse KEGG

Recently, someone asked me if I could generate a list of genes associated with a particular pathway. Sure, I said and hacked together some rather nasty code in R which, given a KEGG pathway identifier, used a combination of the KEGG REST API, DBGET and biomaRt to return HGNC symbols.

Coincidentally, someone asked the same question at Biostar. Pierre recommended the TogoWS REST service, which provides an API to multiple biological data sources. An article describing TogoWS was published in 2010.

An excellent suggestion – and one which, I later discovered, I had bookmarked. Twice. As long ago as 2008. This “rediscovery of things I once knew” happens to me with increasing frequency now, which makes me wonder whether (1) we really are drowning in information, (2) my online curation tools/methods require improvement or (3) my mind is not what it was. Perhaps some combination of all three.

Anyway – using Ruby (1.8.7), a list of HGNC symbols given a KEGG pathway, e.g. MAPK signaling, is as simple as:

require 'rubygems'
require 'open-uri'
require 'json/pure'

j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa04010/genes.json").read)
g = j.first.values.map {|v| /^(.*?);/.match(v)[1] }
# first 5 genes
g[0..4]
# ["MAP3K14", "FGF17", "FGF6", "DUSP9", "MAP3K6"]

This code parses the JSON returned from TogoWS into an array with one element; the element is a hash with key/value pairs of the form:

"9020"=>"MAP3K14; mitogen-activated protein kinase kinase kinase 14 [KO:K04466] [EC:2.7.11.25]"

Values for all keys that I’ve seen to date begin with the HGNC symbol followed by a semicolon, making extraction quite straightforward with a simple regular expression.

April 4, 2013

A brief note: R 3.0.0 and bioinformatics

Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).

Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:

f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)

When this happened:

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
  allocMatrix: too many elements specified

I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.

My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.

March 27, 2013

Git for bioinformaticians at the Bioinformatics FOAM meeting

Last week, I attended the annual Computational and Simulation Sciences and eResearch Conference, hosted by CSIRO in Melbourne. The meeting includes a workshop that we call Bioinformatics FOAM (Focus On Analytical Methods). This year it was run over 2.5 days (up from the previous 1.5 by popular request); one day for internal CSIRO stuff and the rest open to external participants.

I had the pleasure of giving a brief presentation on the use of Git in bioinformatics. Nothing startling; aimed squarely at bioinformaticians who may have heard of version control in general and Git in particular but who are yet to employ either. I’m excited because for once I am free to share, resulting in my first upload to Slideshare in almost 4.5 years. You can view it here, or at the Australian Bioinformatics Network Slideshare, or in the embed below.

See the slides…

March 18, 2013

The end of Google Reader: a scientist’s perspective

Since 2005, I have started almost every working day by using one Web application – an application that occupies a permanent browser tab on my work and home desktop machines. That application is Google Reader.

If you’re reading this, you’re probably aware that Google Reader will cease to exist from July 1 2013. Others have ranted, railed against the corporate machine and expressed their sadness. I thought I’d try to explain why, for this working scientist at least, RSS and feed readers are incredibly useful tools which I think should be valued highly.

Read the rest…

February 26, 2013

R/ggplot2 tip: aes_string

I’m a big fan of ggplot2. Recently, I ran into a situation which called for a useful feature that I had not used previously: aes_string.
Read the rest…

February 13, 2013

Basic R: rows that contain the maximum value of a variable

File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”

Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig
#    vars obs1 obs2
# 1     A    1   11
# 2     B    2   12
# 3     C    3   13
# 4     D    4   14
# 5     E    5   15
# 6     A    6   16
# 7     B    7   17
# 8     C    8   18
# 9     D    9   19
# 10    E   10   20

Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:
Read the rest…

February 12, 2013

Genes x Samples: please explain

One of my bioinformatics pet peeves involves statements like this one, from the CNAmet user guide:

Inputs to CNAmet are three m x n matrices, where m is the number of genes and n the number samples

What we’re looking at here is the hot, but poorly-defined topic of data integration, in which biological measurements from two or more different platforms are somehow combined in a way that provides more information than each platform separately. Read any paper on this topic, download the software and you’ll find example datasets containing two or more matched matrices, with rows where measurements have been summarized to a “gene”. What you won’t find, typically, is a detailed explanation of the summarization procedure that you could implement yourself.

Read the rest…

February 6, 2013

Lots of “open goodness” in the AU/NZ region

January/February are exciting months for open [data|research|science|access] proponents in our region – by which I mean Australia and New Zealand.

First, we’ve enjoyed a speaking tour by Sir Tim Berners-Lee, during which he discussed the benefits of open data several times. I was able to attend two events in Sydney in person and a third, linux.conf.au, by video stream. The events were the work of many people but in particular, Pia Waugh. Go follow her on Twitter, now.

Next – I wish I had been able to get to this one – the Open Research Conference on February 6-7, University of Auckland. I’m enjoying the high-quality live stream right now. Flying the flag for Sydney are Mat and Alex.

Not strictly under the “open” umbrella but worth a mention anyway: software carpentry is in town, February 7-8, just up the road from me at Macquarie University. Looking forward to hearing some reports from that.

January 31, 2013

It’s #overlyhonestmethods come to life!

Retraction Watch reports a study of microarray data sharing. The article, published in Clinical Chemistry, is itself behind a paywall despite trumpeting the virtues of open data. So straight to the Open Access Irony Award group at CiteULike it goes.

I was not surprised to learn that the rate of public deposition of data is low, nor that most deposited data ignores standards and much of it is low quality. What did catch my eye though, was a retraction notice for one of the articles from the study, in which the authors explain the reason for retraction.
Read the rest…

January 11, 2013

The future of science publishing from 1996

Floating by in the Twitter stream, this from @leonidkruglyak. It leads to a light-hearted opinion(ated) piece by Sydney Brenner in Current Biology, 1996.

In 1996, you may recall, the Web was just a few years old. Amusingly (sadly?), it seems that Brenner predicted many of the topics in science publishing that we’re still discussing in 2013. It’s just that he thought they would be implemented in no time at all.

For example, open refereeing:

It is incidents such as this that have led me to question whether the anonymity of referees needs to be guarded so closely

Self-publishing/archiving and post-publication peer review:

The electronic pre-print with open discussion (not refereeing) will soon become commonplace; in fact, labs could go into the publication business by themselves

Demise of the journal impact factor, publishing economics and altmetrics:

We will need something to substitute for the present ratings given to papers appearing in ‘superior, peer-reviewed publications’ (and commercial publishers will find ways of making people pay for this)

Perhaps we should have a readership index; it should not be beyond the wit of man to devise a way of recording whenever a paper is read, hard-copied or cited

As Ethan said:

Follow

Get every new post delivered to your Inbox.

Join 2,202 other followers