Where N is an arbitrarily large fraction approaching one

What’s N? It’s the fraction of time that bioinformaticians spend obtaining, formatting and getting raw data ready to use, as opposed to analysing it.

There’ll be a longer post on this topic soon. Suffice to say, I’ve spent the last month evaluating the performance of 5 predictive tools that are available on the web. To do this, a test dataset of 200 or so sequences had to be submitted to each one. Each tool generates a score for particular residues in the sequence. The final output, which is what I require to do some statistical analysis, looks something like this:

P08153  114     method     61.74   0
P08153  522     method     82.10   1

where we have a sequence UniProt accession number, a sequence position, the name of the tool used (method), a score and either 1 (a positive instance) or 0 (a negative instance).

Doesn’t look too hard, does it? Except that:

  • None of the web servers provide web services or APIs
  • None of them provide standalone software for download
  • Most of them don’t generate easily-parsed output (delimited plain text)
  • Most of them have limited batch upload and processing capabilities

The solution, as always, is to hack together several hundred lines of poorly-written Perl (using HTML::Form in my case) to send each sequence to the server, grab the HTML that comes back, parse it and write out text files in the format shown above.

That’s 3-4 weeks and 500 lines of throwaway code just to get the raw data in the right state for analysis

When I started out in bioinformatics, I used to joke that at least 50% of my time was spent just obtaining raw data and formatting files. Over the years, I’ve revised my estimate. It’s currently at around 80-90% and I’m not sure that it’s still a joke.

Why is this trend in the wrong direction? When does it become untenable? I’m starting to think that my job title should be “data munger”, not “research officer”. I wouldn’t mind if data munging was perceived as a skill in academia but when funding is results-based, it will only ever be seen as the means to an end. Which it is, of course.

14 thoughts on “Where N is an arbitrarily large fraction approaching one

  1. Thank you for writing this. It’s nice to know that when I spend long days and nights parsing data from one format to another, I’m not alone.

    So what’s the solution? Better standards? It seems like every bioinformatics project out there comes up with a new schema. Is it possible to define (yet another) a flexible data format that will cover many applications? Would people use it even if we could? Will the entry of the big tech companies like google and microsoft impose some order on the biomedical field?

  2. I usually say that parsing data is the bioinformatics equivalent to pipetting and preparing buffers. In our case it would much easier if databases and services were set up right.

    Can we play guess the method ? pSTY :)

  3. That’s a nice analogy. And yes, pSTY comes into it. This cautious blogging about unpublished stuff really isn’t my style. Oh, for an open science world.

  4. Pingback: » Learning from tech - Better APIs »

  5. Absolutely agree. The vast, vast majority of my time as a bioinformaticist revolves around the letter P. Parsing and porting with Perl, PHP and Python. 90% of my time spent doing soul-destroying gruntwork sometimes seems like an optimistic estimate.

  6. I would not use algorithms that are only available as web-servers. If there is no source code, I’d rather ignore the algorithm. Creating your own predictor might be more worthwhile than hacking away with Html::Form. Advantages:

    You will have a fast program that you can run on huge datasets for a background estimation

    You will understand the algorithm well

    The data format will be perfect for your needs

    All depends, of course, if you would be able to do this in 4 weeks… :-)


  7. Max, I’d normally agree with you. In this case though, the webserver comparison was requested by referees for a paper :( Clearly they’ve never tried it themselves.

  8. Pingback: More on data munging… « Open.nfo

  9. There is a simple solution and I would love to discuss this in more detail on nodalpoint: Every damn author should be obliged to put the source code as supplementary information. A lot of trouble could be spared and the whole field would benefit from this kind of openess. I frankly don’t care as much about about open access as about accessible code complete with all the data, one big zipfile such that everyone can reproduce these results and possibly find the bug the is responsible for them. Or everyone could at least copy a part and build something new instead of re-writing all the time the ideas of others just because their code is unavailable/not explained / lacking the data from the paper.

  10. @Max: “…should be obliged to put the source code as supplementary information…instead of re-writing all the time the ideas of others just because their code is unavailable/not explained / lacking the data from the paper.”

    maybe taking over all or part of the SIGMOD/PODS (top database conferences) Experimental Repeatability Requirements
    tha aim is to To help published papers achieve an impact and stand as reliable reference-able works for future research, the SIGMOD 2008 reviewing process includes an assessment of the extent to which the presented experiments are repeatable by someone with access to all the required hardware, software, and test data. Thus, we attempt to establish that the code developed by the authors exists, runs correctly on well-defined inputs, and performs in a manner compatible with that presented in the paper.

    it won’t solve the data munging (good software development and probably a few standards may), but might make it at least less cumbersome.

  11. I’m glad I’m not the only one that feels this way. Most of my time is spend cleaning up data, and then serving it up to others in the group to analyze. Managing/munging the data is a job in itself. I wish I had the time to think about the biology.

  12. Pingback: Bioinformatics Zen » How to avoid errors when processing CSV files

  13. Pingback: Software portability and virtual appliances « Freelancing science

Comments are closed.