Why can’t PubMed or academic journals get the basics right?

A recent question at BioStar asked “Is the NAR database list available in a computer readable format?” The short answer is “no” and Pierre has done some excellent preliminary work to address the issue.

I’ve been working on a database and web application to check the associated URLs but quite frankly, this is tedious, a waste of everyone’s time and could be entirely avoided if the publishing industry did a better job. All that’s required is that either NAR or PubMed provide structured data – XML, Medline format, I don’t care what – containing a field that looks something like this:

URL    http://a.valid.url.goes.here

That way, we could all avoid writing regular expressions to detect URLs in abstracts. No wait – to detect broken URLs in abstracts. You would not believe how many of them look like this:

URL    http://www.amaze.ulb. ac.be/

Someone helpfully informed me via Twitter that this is “often a result of typesetting.” Thanks for that.

6 thoughts on “Why can’t PubMed or academic journals get the basics right?

    • I’m working on it, but figuring out which URLs are invalid URLs is so incredibly tedious, I’m attempted to abandon the PubMed approach altogether. There’s a reason why this stuff does not get analysed and sorted out – it makes people so mad, they just can’t continue…

  2. It’s because they don’t get the whole idea of computer-readable data period, and imagine that humans are the only readers of papers. It’s the same with data included in papers — how many times have you downloaded the data only to find that it is a PDF and not a tab-delimited text file or similar? Sure you can copy and paste from the PDF and try to reconstruct something useable, but why?

    • Yes, you have summarised the issue very nicely – this is the heart of the problem. I struggle to see how (biological) science can ever embrace “big data” if we don’t make the data available in a form that can be parsed.

