Why can’t PubMed or academic journals get the basics right?

A recent question at BioStar asked “Is the NAR database list available in a computer readable format?” The short answer is “no” and Pierre has done some excellent preliminary work to address the issue.

I’ve been working on a database and web application to check the associated URLs but quite frankly, this is tedious, a waste of everyone’s time and could be entirely avoided if the publishing industry did a better job. All that’s required is that either NAR or PubMed provide structured data – XML, Medline format, I don’t care what – containing a field that looks something like this:

URL    http://a.valid.url.goes.here

That way, we could all avoid writing regular expressions to detect URLs in abstracts. No wait – to detect broken URLs in abstracts. You would not believe how many of them look like this:

URL    http://www.amaze.ulb. ac.be/
                            ^

Someone helpfully informed me via Twitter that this is “often a result of typesetting.” Thanks for that.

6 thoughts on “Why can’t PubMed or academic journals get the basics right?

  1. Pingback: Twenty million papers in PubMed: a triumph or a tragedy? « O'Really?

    1. nsaunders Post author

      I’m working on it, but figuring out which URLs are invalid URLs is so incredibly tedious, I’m attempted to abandon the PubMed approach altogether. There’s a reason why this stuff does not get analysed and sorted out – it makes people so mad, they just can’t continue…

  2. Jonathan Badger

    It’s because they don’t get the whole idea of computer-readable data period, and imagine that humans are the only readers of papers. It’s the same with data included in papers — how many times have you downloaded the data only to find that it is a PDF and not a tab-delimited text file or similar? Sure you can copy and paste from the PDF and try to reconstruct something useable, but why?

    1. nsaunders Post author

      Yes, you have summarised the issue very nicely – this is the heart of the problem. I struggle to see how (biological) science can ever embrace “big data” if we don’t make the data available in a form that can be parsed.

Comments are closed.