A new low in “databases”: the PDF

I’ve had a half-formed, but not very interesting blog post in my head for some months now. It’s about a conversation I had with a PhD student, around 10 years ago, after she went to a bioinformatics talk titled “Excel is not a database” and how she laughed as I’d been telling her that “for years already”. That’s basically the post so as I say, not that interesting, except as an illustration that we’ve been talking about this stuff for a long time (and little has changed).

HEp-2 or not HEp2?

HEp-2 or not HEp2?

Anyway, we have something better. I was exploring PubMed Commons, which is becoming a very good resource. The top-featured comment looks very interesting (see image, right).

Intrigued, I went to investigate the Database of Cross-contaminated or Misidentified Cell Lines, hovered over the download link and saw that it’s – wait for it – a PDF. I’ll say that again. The “database” is a PDF.

The sad thing is that this looks like very useful, interesting information which I’m sure would be used widely if presented in an appropriate (open) format and better-publicised. Please, biological science, stop embarrassing yourself. If you don’t know how to do data properly, talk to someone who does.

11 thoughts on “A new low in “databases”: the PDF

  1. I am curious: In what format would you offer the data? I am not sure, what is the best. My thoughts so far:
    – The set contains ~ 500 records. The majority of users would just lookup ‘their’ cell lines.
    – A ‘real’ database would be an overkill, i suppose. Also too hard to handle for most biologists. Sure it would be nice to have the data in a computer-accessible form, but i think it would not fit the use-case for most researchers.
    – Some structured flat file. you can write a parser with relatively little efford, but non-CS will have trouble finding their information.
    – A PDF file: NOT a database. But it fits the usecase: Every researcher can open it, search for their cell lines. Of course its a real pain if not impossible to get the data out.
    – An Excel/Spreadsheet file: This is not a database, either. At least it gives some tabular structure to the data. Widly accepted as a file format for data, almost everyone can open/handle it, but it’s a proprietary file format (At least MSO Excel). Im not happy with it but it looks like a good trade-off.
    – Other alternatives, eg. a web server? Not worth the effort, since only 500 records.
    My conclusion so far: PDF is not a bad option. And if you offer the data in a second structured format (Some text, XML, CSV, JSON, …), everyone will be happy. Of course, technically you shouldnt call it ‘database’ then. But since the dataset is fairly small, i doubt that a databae would be the best option.

    • What Andrés said :) My preference is always delimited plain text (CSV, TSV). That allows users many options: open in text editor, process using shell tools, import to a spreadsheet or a database table. There are 2 issues here really. One is that PDF is not a good way to distribute usable data. The other is that when you claim to provide a database, the last thing any sane person expects to find is a PDF.

  2. Plain text and/or CSV should be the basic format available. Anyone who can look up a couple of cell lines in a pdf file is also able to do it in a plain text file (in notepad or whatever their system opens txt files) or even better to import it in any Spreadsheet file. OK, they are only 500 records, but your query could be huge and allowing some automation is crucial. Of course, offering an additional pdf file for those old-school guys who need to print in paper to find their lines is OK. But I agree with the fundamentals of the post, Biologists need to think better on how our data is going to be used.
    PS. Nice to have found this blog!

  3. I am also amazed at the number of authors that call their web application or a static website a “web service”. Let’s make this clear: if users can’t access your data programmatically it’s not a service.

    • Best not to expect too much from Nature, Science et al. My all-time favourite is the Neanderthal genome paper in Science, which features a supplemental PDF running to 175 pages.

  4. Sorry to say, but life sciences [*] in general have a bunch of very fundamental problems which make it next to impossible to deal with data properly.

    You all know it. We need to publish fast, and don’t have time to care about the data. Actually, with increasing technical literacy, some groups have developed databases, but in general, every PhD student has his own little stack of data tables. Which are quite probably lost if they leave academia. Some journals now encourage that you transmit your data to Dryad, or suchalike, but since competition is hard (other groups also need to publish fast…), you would probably never transmit your whole database – just an excerpt, maybe as a large table.

    Also, there is rarely a dedicated biometrician, and even less frequent a database admin (i.e. somebody with a technical background of databases) in all the working groups I ever came to know more closely. We teach ourselves MS Access, and probably some MySQL (I’ve given that up a after a short try, because no-one else was interested, and I had other things to do). We teach ourselves R. We teach ourselves everything, basically. On the job. Some people excel at the technical background, but they rarely are the high-impact, high-frequency publishing ones. The ones who care about databases etc. often leave academia because, well: there are no positions for people who spend so much time on databases, and not on publishing.

    Some research institutions understood that problem already and have created support positions. But most European universities still can’t, because, you know: funding.

    I don’t see much of a change, and I’m in academia since the late 1990s.

    [*] Please mind the slight difference of “life sciences” to “biological sciences”, as medical sciences are probably something quite distinct from most disciplines in the field of biology, both in matters of )

    Having said all this, it is still an abomination that in 2014, people are publishing a PDF as a “database”.

    • All true and terribly depressing. Well said. In my darker moments I do fear that we are living in an age where due to poor practices, much of life science is just garbage.

      • Amen. In addition to the problems outlined above I encountered additional problems when working with existing “manually curated” databases with biological databases.

        I worked with a major database (CHEMBL) which was supposed to contain meaningful information information on compounds, targets, and associated bioactivities. What I found was a complete hodgepodge of data. Targets are vaguely defined and can mean anything from single molecules to whole organisms. Bioactivities have units in % or my personal favorite: the single letter (i.e., “t”).

        Even when biological data is “manually curated” the quality can be extremely poor and I fear that this is the rule, rather than the exception.
        It’s quite an unfortunate situation.

Comments are closed.