Why bioinformaticians hate the “traditional journal article”

This bioinformatician, at least. Hate is a strong word. Perhaps “dislike” is better.

Short answer: because you can’t get data out of them easily, if at all. Longer answer:

If I still had time for fun “side projects”[1], I’d be interested in the newly-sequenced genomes of Pandoraviruses. In particular, I’d be somewhat suspicious regarding the very high proportion of hypothetical and often short ORFs in their genomes. I learn from the publication that 210 putative proteins from P. salinus were validated using mass spectrometry. I wonder: which of those correspond to the short, hypothetical ORF products?

Note – I don’t mean to single out this article for particular criticism. It just provides a good, recent example of the issues that affect almost every journal article, in the eyes of people who care about data.

Problem 1: the supplementary data
For reasons which elude me, the favoured form of supplementary data is still a huge PDF file. PDFs were designed to be printed out and read by humans. Why anyone still believes that they’re suitable as containers for raw data is one of the great mysteries of science.

This article is no exception. That said, it does at least describe the computational methods used in reasonable detail.

I struggle with the entire notion of data as “supplemental”. Presumably, it still means “not considered sufficiently important to occupy page space.” To a bioinformatician of course, the data are far more important than the prosaic description in the journal article. As for page space – hello? The Web? Anyone?

Problem 2: the arbitrary nature of what data are included
Proteins identified using mass spectrometry are listed in table S2 of the supplementary data. Except where they are not. The authors have chosen to list only those proteins (42/210) which also have a significant BLAST match in the NR database. So the answer to my question: which of the validated proteins correspond to short ORFs annotated as “hypothetical” remains “I don’t know.”

I’d argue that if data are cited in the main article:

The ion-mass data were interpreted in reference to a database that includes the A. castellanii (27) and the P. salinus predicted protein sequences. A total of 266 proteins were identified on the basis of at least two different peptides. Fifty-six of them corresponded to A. castellanii proteins presumably associated with the P. salinus particles, and 210 corresponded to predicted P. salinus CDSs.

then that same data – in this case, the complete lists of identified proteins – needs to be made available.

Problem 3: naming things
Staying with table S2, we see one of the great problems in computation – naming things. Column 2 is headed “ORF #” and contains an integer: 400, 258, 356…

What are these numbers? Are they supposed to help us retrieve the ORF in question? We can grab the Genbank file of P. salinus ORFs from the NCBI and look at the annotations. How about RNA polymerase Rpb5, which table S2 tells us is “ORF # 650”?

     CDS             627452..628276
                     /product="RNA polymerase Rpb5"

A few other examples match up too: looks like “ORF #” has become “old_locus_tag” in the database deposition.

Great! Unless we download the ORFs in FASTA format, where old_locus_tag is not found in the headers:

>lcl|NC_022098.1_cdsid_YP_008437064.1 [gene=O182_gp0647] [protein=RNA polymerase Rpb5] [protein_id=YP_008437064.1] [location=627452..628276]

Or the proteins in FASTA format…where it seems that headers contain the old locus tag for proteins annotated as hypothetical, but no tag at all where proteins are named. And old locus tags apparently run in reverse order through the file:

>gi|531035435|ref|YP_008437065.1| hypothetical protein ps_651 [Pandoravirus salinus]

>gi|531035434|ref|YP_008437064.1| RNA polymerase Rpb5 [Pandoravirus salinus]

>gi|531035433|ref|YP_008437063.1| hypothetical protein ps_649 [Pandoravirus salinus]

The problem here is that database deposition and annotation are completely separate processes to article writing and submission. Were this not the case, confusing inconsistencies between what you read and what you see in the data could be avoided.

If you publish a genome, I should be able to derive all of the information that you derived. Ideally, I would be able to do so using raw data deposited alongside your publication, rather than digging around in databases and piecing together what you did after the fact.

And that is why I – and I’d say most bioinformaticians – hate the traditional journal article.

[1] I’m a parent of a 2 year-old

10 thoughts on “Why bioinformaticians hate the “traditional journal article”

  1. I always thought it was a bit disingenuous to not provide the data and analysis workflow. It’s slowly starting to turn, especially with toolks like IPython and knitr.

  2. It’s not all doom and gloom though.

    A few newer journals are starting to publish proper data-integrated research publications where the data and its re-usability are strongly considered. These journals such as Biodiversity Data Journal (described in GigaScience here: http://www.gigasciencejournal.com/content/2/1/14/ ) are still few and far between but I’m glad they exist. I guess it might take a while for this technical excellence to spread to the trad journals though…?

    • Thanks for the mention Ross, but on top of the example where we worked with BDJ on a new generation of data-rich species description, for the last year GigaScience has been following exactly the integrated data/analysis/workflow approach discussed here. A nice example of this is our SOAPdenovo2 publication (http://www.gigasciencejournal.com/content/1/1/18), where the ~80GB of supporting data (http://dx.doi.org/10.5524/100038) and pipelines containing all of the necessary shell scripts and tools (http://dx.doi.org/10.5524/100044) are integrated into the paper with cited DOIs and hosted in our GigaDB server, and the workflows are also implemented in our Galaxy server http://galaxy.cbiit.cuhk.edu.hk/. There are a number of other data-focussed journals launching at the moment that link out to external data repositories, but its nice to see the included hosting and mirroring we offer is potentially very useful.

  3. Pingback: Why bioinformaticians hate the "traditiona...

  4. It’s easy to say “Hello, the Web!” in regard to journal space (and I’m one of those people who when being on a Nature paper generally find my figures and data in the supplemental section, and so I don’t really like the concept either), but how could the supplemental section be eliminated in journals like Nature where there is still a print copy? Of course the idea of a “supplemental section” is absurd in web-only journals (not that this stops all of them from mimicking the idea out of habit!)

    • This will sound flippant and unrealistic, but…I honestly feel that the answer is obvious – there should not be a print copy. Or if there is, it should contain brief summaries of articles with instructions to visit the web site. From where, perhaps, users could generate customised versions to print.

  5. Pingback: Somewhere else, part 89 | Freakonometrics

  6. Pingback: Links 11/9/13 | Mike the Mad Biologist

  7. Nice, to know more people that looks to the data! As you seem well informated I ask you: is there any journal that check before publishing that the paper is reproducible? In bioinformatics field it is easy to store, the initial data, the programs, and tools used, so it is easier to check how you did with this article how the paper is done but I haven’t heard of any journal doing so.
    From my point of view it would be easy to check, in the peer-viewer ask if he/she could reproduce the paper with the information provided, if it is possible, then accept if not, deny publication until it is possible. (The cons I see are when are used web-tools outside of the researcher)

Comments are closed.