Where N is an arbitrarily large fraction approaching one

September 19, 2007September 20, 2007 / nsaunders

What’s N? It’s the fraction of time that bioinformaticians spend obtaining, formatting and getting raw data ready to use, as opposed to analysing it.

There’ll be a longer post on this topic soon. Suffice to say, I’ve spent the last month evaluating the performance of 5 predictive tools that are available on the web. To do this, a test dataset of 200 or so sequences had to be submitted to each one. Each tool generates a score for particular residues in the sequence. The final output, which is what I require to do some statistical analysis, looks something like this:

P08153  114     method     61.74   0
P08153  522     method     82.10   1

where we have a sequence UniProt accession number, a sequence position, the name of the tool used (method), a score and either 1 (a positive instance) or 0 (a negative instance).

Doesn’t look too hard, does it? Except that:

None of the web servers provide web services or APIs
None of them provide standalone software for download
Most of them don’t generate easily-parsed output (delimited plain text)
Most of them have limited batch upload and processing capabilities

The solution, as always, is to hack together several hundred lines of poorly-written Perl (using HTML::Form in my case) to send each sequence to the server, grab the HTML that comes back, parse it and write out text files in the format shown above.

That’s 3-4 weeks and 500 lines of throwaway code just to get the raw data in the right state for analysis

When I started out in bioinformatics, I used to joke that at least 50% of my time was spent just obtaining raw data and formatting files. Over the years, I’ve revised my estimate. It’s currently at around 80-90% and I’m not sure that it’s still a joke.

Why is this trend in the wrong direction? When does it become untenable? I’m starting to think that my job title should be “data munger”, not “research officer”. I wouldn’t mind if data munging was perceived as a skill in academia but when funding is results-based, it will only ever be seen as the means to an end. Which it is, of course.

14 thoughts on “Where N is an arbitrarily large fraction approaching one”

Chris

September 19, 2007 at 15:47

Thank you for writing this. It’s nice to know that when I spend long days and nights parsing data from one format to another, I’m not alone.

So what’s the solution? Better standards? It seems like every bioinformatics project out there comes up with a new schema. Is it possible to define (yet another) a flexible data format that will cover many applications? Would people use it even if we could? Will the entry of the big tech companies like google and microsoft impose some order on the biomedical field?
Pedro Beltrao

September 19, 2007 at 21:27

I usually say that parsing data is the bioinformatics equivalent to pipetting and preparing buffers. In our case it would much easier if databases and services were set up right.

Can we play guess the method ? pSTY :)
nsaunders

September 19, 2007 at 23:26

That’s a nice analogy. And yes, pSTY comes into it. This cautious blogging about unpublished stuff really isn’t my style. Oh, for an open science world.
Morgan

September 20, 2007 at 02:26

I completely agree! I usually hope that someone else has written a module in Bioperl, failing that I have to write another perl wrapper.
Pingback: » Learning from tech - Better APIs »
Julian Tonti-Filippini

September 24, 2007 at 12:11

Absolutely agree. The vast, vast majority of my time as a bioinformaticist revolves around the letter P. Parsing and porting with Perl, PHP and Python. 90% of my time spent doing soul-destroying gruntwork sometimes seems like an optimistic estimate.
Max

September 24, 2007 at 19:50

I would not use algorithms that are only available as web-servers. If there is no source code, I’d rather ignore the algorithm. Creating your own predictor might be more worthwhile than hacking away with Html::Form. Advantages:

You will have a fast program that you can run on huge datasets for a background estimation

You will understand the algorithm well

The data format will be perfect for your needs

All depends, of course, if you would be able to do this in 4 weeks… :-)

Max
nsaunders

September 24, 2007 at 21:19

Max, I’d normally agree with you. In this case though, the webserver comparison was requested by referees for a paper :( Clearly they’ve never tried it themselves.
Pingback: More on data munging… « Open.nfo
Max

September 28, 2007 at 07:31

There is a simple solution and I would love to discuss this in more detail on nodalpoint: Every damn author should be obliged to put the source code as supplementary information. A lot of trouble could be spared and the whole field would benefit from this kind of openess. I frankly don’t care as much about about open access as about accessible code complete with all the data, one big zipfile such that everyone can reproduce these results and possibly find the bug the is responsible for them. Or everyone could at least copy a part and build something new instead of re-writing all the time the ideas of others just because their code is unavailable/not explained / lacking the data from the paper.
keet

September 28, 2007 at 18:39

@Max: “…should be obliged to put the source code as supplementary information…instead of re-writing all the time the ideas of others just because their code is unavailable/not explained / lacking the data from the paper.”

maybe taking over all or part of the SIGMOD/PODS (top database conferences) Experimental Repeatability Requirements
http://www.sigmod08.org/sigmod_research.shtml
tha aim is to To help published papers achieve an impact and stand as reliable reference-able works for future research, the SIGMOD 2008 reviewing process includes an assessment of the extent to which the presented experiments are repeatable by someone with access to all the required hardware, software, and test data. Thus, we attempt to establish that the code developed by the authors exists, runs correctly on well-defined inputs, and performs in a manner compatible with that presented in the paper.

it won’t solve the data munging (good software development and probably a few standards may), but might make it at least less cumbersome.
Amit

October 2, 2007 at 23:50

I’m glad I’m not the only one that feels this way. Most of my time is spend cleaning up data, and then serving it up to others in the group to analyze. Managing/munging the data is a job in itself. I wish I had the time to think about the biology.
Pingback: Bioinformatics Zen » How to avoid errors when processing CSV files
Pingback: Software portability and virtual appliances « Freelancing science