This bioinformatician, at least. Hate is a strong word. Perhaps “dislike” is better.
Short answer: because you can’t get data out of them easily, if at all. Longer answer:
If I still had time for fun “side projects”, I’d be interested in the newly-sequenced genomes of Pandoraviruses. In particular, I’d be somewhat suspicious regarding the very high proportion of hypothetical and often short ORFs in their genomes. I learn from the publication that 210 putative proteins from P. salinus were validated using mass spectrometry. I wonder: which of those correspond to the short, hypothetical ORF products?
Note – I don’t mean to single out this article for particular criticism. It just provides a good, recent example of the issues that affect almost every journal article, in the eyes of people who care about data.
Problem 1: the supplementary data
For reasons which elude me, the favoured form of supplementary data is still a huge PDF file. PDFs were designed to be printed out and read by humans. Why anyone still believes that they’re suitable as containers for raw data is one of the great mysteries of science.
This article is no exception. That said, it does at least describe the computational methods used in reasonable detail.
I struggle with the entire notion of data as “supplemental”. Presumably, it still means “not considered sufficiently important to occupy page space.” To a bioinformatician of course, the data are far more important than the prosaic description in the journal article. As for page space – hello? The Web? Anyone?
Problem 2: the arbitrary nature of what data are included
Proteins identified using mass spectrometry are listed in table S2 of the supplementary data. Except where they are not. The authors have chosen to list only those proteins (42/210) which also have a significant BLAST match in the NR database. So the answer to my question: which of the validated proteins correspond to short ORFs annotated as “hypothetical” remains “I don’t know.”
I’d argue that if data are cited in the main article:
The ion-mass data were interpreted in reference to a database that includes the A. castellanii (27) and the P. salinus predicted protein sequences. A total of 266 proteins were identified on the basis of at least two different peptides. Fifty-six of them corresponded to A. castellanii proteins presumably associated with the P. salinus particles, and 210 corresponded to predicted P. salinus CDSs.
then that same data – in this case, the complete lists of identified proteins – needs to be made available.
Problem 3: naming things
Staying with table S2, we see one of the great problems in computation – naming things. Column 2 is headed “ORF #” and contains an integer: 400, 258, 356…
What are these numbers? Are they supposed to help us retrieve the ORF in question? We can grab the Genbank file of P. salinus ORFs from the NCBI and look at the annotations. How about RNA polymerase Rpb5, which table S2 tells us is “ORF # 650”?
CDS 627452..628276 /locus_tag="O182_gp0647" /old_locus_tag="ps_650" /codon_start=1 /product="RNA polymerase Rpb5" /protein_id="YP_008437064.1" /db_xref="GI:531035434" /db_xref="GeneID:16605784"
A few other examples match up too: looks like “ORF #” has become “old_locus_tag” in the database deposition.
Great! Unless we download the ORFs in FASTA format, where old_locus_tag is not found in the headers:
>lcl|NC_022098.1_cdsid_YP_008437064.1 [gene=O182_gp0647] [protein=RNA polymerase Rpb5] [protein_id=YP_008437064.1] [location=627452..628276]
Or the proteins in FASTA format…where it seems that headers contain the old locus tag for proteins annotated as hypothetical, but no tag at all where proteins are named. And old locus tags apparently run in reverse order through the file:
>gi|531035435|ref|YP_008437065.1| hypothetical protein ps_651 [Pandoravirus salinus] >gi|531035434|ref|YP_008437064.1| RNA polymerase Rpb5 [Pandoravirus salinus] >gi|531035433|ref|YP_008437063.1| hypothetical protein ps_649 [Pandoravirus salinus]
The problem here is that database deposition and annotation are completely separate processes to article writing and submission. Were this not the case, confusing inconsistencies between what you read and what you see in the data could be avoided.
If you publish a genome, I should be able to derive all of the information that you derived. Ideally, I would be able to do so using raw data deposited alongside your publication, rather than digging around in databases and piecing together what you did after the fact.
And that is why I – and I’d say most bioinformaticians – hate the traditional journal article.
 I’m a parent of a 2 year-old