Where N is an arbitrarily large fraction approaching one

What’s N? It’s the fraction of time that bioinformaticians spend obtaining, formatting and getting raw data ready to use, as opposed to analysing it.

There’ll be a longer post on this topic soon. Suffice to say, I’ve spent the last month evaluating the performance of 5 predictive tools that are available on the web. To do this, a test dataset of 200 or so sequences had to be submitted to each one. Each tool generates a score for particular residues in the sequence. The final output, which is what I require to do some statistical analysis, looks something like this:

P08153  114     method     61.74   0
P08153  522     method     82.10   1

where we have a sequence UniProt accession number, a sequence position, the name of the tool used (method), a score and either 1 (a positive instance) or 0 (a negative instance).

Doesn’t look too hard, does it? Except that:

  • None of the web servers provide web services or APIs
  • None of them provide standalone software for download
  • Most of them don’t generate easily-parsed output (delimited plain text)
  • Most of them have limited batch upload and processing capabilities

The solution, as always, is to hack together several hundred lines of poorly-written Perl (using HTML::Form in my case) to send each sequence to the server, grab the HTML that comes back, parse it and write out text files in the format shown above.

That’s 3-4 weeks and 500 lines of throwaway code just to get the raw data in the right state for analysis

When I started out in bioinformatics, I used to joke that at least 50% of my time was spent just obtaining raw data and formatting files. Over the years, I’ve revised my estimate. It’s currently at around 80-90% and I’m not sure that it’s still a joke.

Why is this trend in the wrong direction? When does it become untenable? I’m starting to think that my job title should be “data munger”, not “research officer”. I wouldn’t mind if data munging was perceived as a skill in academia but when funding is results-based, it will only ever be seen as the means to an end. Which it is, of course.