Archive for ‘bioinformatics’

April 22, 2013

How to: remember that you once knew how to parse KEGG

Recently, someone asked me if I could generate a list of genes associated with a particular pathway. Sure, I said and hacked together some rather nasty code in R which, given a KEGG pathway identifier, used a combination of the KEGG REST API, DBGET and biomaRt to return HGNC symbols.

Coincidentally, someone asked the same question at Biostar. Pierre recommended the TogoWS REST service, which provides an API to multiple biological data sources. An article describing TogoWS was published in 2010.

An excellent suggestion – and one which, I later discovered, I had bookmarked. Twice. As long ago as 2008. This “rediscovery of things I once knew” happens to me with increasing frequency now, which makes me wonder whether (1) we really are drowning in information, (2) my online curation tools/methods require improvement or (3) my mind is not what it was. Perhaps some combination of all three.

Anyway – using Ruby (1.8.7), a list of HGNC symbols given a KEGG pathway, e.g. MAPK signaling, is as simple as:

require 'rubygems'
require 'open-uri'
require 'json/pure'

j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa04010/genes.json").read)
g = j.first.values.map {|v| /^(.*?);/.match(v)[1] }
# first 5 genes
g[0..4]
# ["MAP3K14", "FGF17", "FGF6", "DUSP9", "MAP3K6"]

This code parses the JSON returned from TogoWS into an array with one element; the element is a hash with key/value pairs of the form:

"9020"=>"MAP3K14; mitogen-activated protein kinase kinase kinase 14 [KO:K04466] [EC:2.7.11.25]"

Values for all keys that I’ve seen to date begin with the HGNC symbol followed by a semicolon, making extraction quite straightforward with a simple regular expression.

April 4, 2013

A brief note: R 3.0.0 and bioinformatics

Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).

Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:

f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)

When this happened:

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
  allocMatrix: too many elements specified

I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.

My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.

March 27, 2013

Git for bioinformaticians at the Bioinformatics FOAM meeting

Last week, I attended the annual Computational and Simulation Sciences and eResearch Conference, hosted by CSIRO in Melbourne. The meeting includes a workshop that we call Bioinformatics FOAM (Focus On Analytical Methods). This year it was run over 2.5 days (up from the previous 1.5 by popular request); one day for internal CSIRO stuff and the rest open to external participants.

I had the pleasure of giving a brief presentation on the use of Git in bioinformatics. Nothing startling; aimed squarely at bioinformaticians who may have heard of version control in general and Git in particular but who are yet to employ either. I’m excited because for once I am free to share, resulting in my first upload to Slideshare in almost 4.5 years. You can view it here, or at the Australian Bioinformatics Network Slideshare, or in the embed below.

See the slides…

February 12, 2013

Genes x Samples: please explain

One of my bioinformatics pet peeves involves statements like this one, from the CNAmet user guide:

Inputs to CNAmet are three m x n matrices, where m is the number of genes and n the number samples

What we’re looking at here is the hot, but poorly-defined topic of data integration, in which biological measurements from two or more different platforms are somehow combined in a way that provides more information than each platform separately. Read any paper on this topic, download the software and you’ll find example datasets containing two or more matched matrices, with rows where measurements have been summarized to a “gene”. What you won’t find, typically, is a detailed explanation of the summarization procedure that you could implement yourself.

Read the rest…

January 31, 2013

It’s #overlyhonestmethods come to life!

Retraction Watch reports a study of microarray data sharing. The article, published in Clinical Chemistry, is itself behind a paywall despite trumpeting the virtues of open data. So straight to the Open Access Irony Award group at CiteULike it goes.

I was not surprised to learn that the rate of public deposition of data is low, nor that most deposited data ignores standards and much of it is low quality. What did catch my eye though, was a retraction notice for one of the articles from the study, in which the authors explain the reason for retraction.
Read the rest…

October 22, 2012

Gene name errors and Excel: lessons not learned

June 23, 2004. BMC Bioinformatics publishes “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”. We roll our eyes. Do people really do that? Is it really worthy of publication? However, we admit that if it happens then it’s good that people know about it.

October 17, 2012. A colleague on our internal Yammer network writes:
Read the rest…

August 16, 2012

Twitter coverage of the ISMB 2012 meeting: some statistics

OK, let’s do this: some statistics and visualization of the tweets for ISMB 2012.

Read the rest…

Tags: ,
August 13, 2012

ISMB 2012 on Twitter: here today, gone tomorrow

In previous years, when FriendFeed was used as the micro-blogging platform for the annual ISMB meeting, I’ve written a post describing some statistical analysis of the conference coverage. Here’s my post from last year.

This year, it appears that the majority of the conference coverage happened at Twitter, using the #ISMB hashtag. Here’s what happened on July 18th when I used the R package twitteR to retrieve ISMB-related tweets for July 13/14:

library(twitteR)
ismb1 <- searchTwitter("#ISMB", since = "2012-07-13", until = "2012-07-14")
length(ismb1)
# [1] 383

383 tweets. Here’s what happened when I ran the same query today:

library(twitteR)
ismb1 <- searchTwitter("#ISMB", since = "2012-07-13", until = "2012-07-14")
length(ismb1)
# [1] 0

Zero tweets. Indeed, run the same query via the Twitter web interface and you’ll see only a very few tweets with the message “Older Tweet results for #ismb are unavailable.”

So far as Twitter is concerned, ISMB 2012 never happened. Or if it did, the data are buried away in a data centre, inaccessible to the likes of you and I. Did you ever hear anything more about that plan to archive every Tweet at the Library of Congress? Neither did I. I very much doubt that it’s going to happen.

I think Twitter is great – for broadcasting short pieces of information, such as useful URLs, in near real-time. For conference coverage which benefits from threaded conversation, longer comments and archiving, I think it’s rubbish.

On July 18 I did manage to retrieve 3162 Tweets for ISMB 2012, created between July 13 and July 17. I’ll write about them in a forthcoming post. All I’ll say for now is – lucky I was able to grab them when I did.

July 23, 2012

We really don’t care what statistical method you used

Update: as pointed out in the comments, the amusing error in this article has been “corrected” (or at least, “edited away”). Thanks for your interest.
Update: I note that this article is now “Highly Accessed” ;)

An integrative analysis of DNA methylation and RNA-Seq data for human heart, kidney and liver
BMC Systems Biology 2011, 5(Suppl 3):S4

(insert statistical method here). No, really.

With thanks to Simon J Greenhill and Dave Winter.

Tags: , ,
July 18, 2012

Fixed that for you

A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases.

F1.fixed

Why not also highlight the key genes by colouring cells in red?

You’re welcome.

Follow

Get every new post delivered to your Inbox.

Join 2,202 other followers