A recent question at BioStar asked “Is the NAR database list available in a computer readable format?” The short answer is “no” and Pierre has done some excellent preliminary work to address the issue.
I’ve been working on a database and web application to check the associated URLs but quite frankly, this is tedious, a waste of everyone’s time and could be entirely avoided if the publishing industry did a better job. All that’s required is that either NAR or PubMed provide structured data – XML, Medline format, I don’t care what – containing a field that looks something like this:
URL http://a.valid.url.goes.here
That way, we could all avoid writing regular expressions to detect URLs in abstracts. No wait – to detect broken URLs in abstracts. You would not believe how many of them look like this:
URL http://www.amaze.ulb. ac.be/ ^
Someone helpfully informed me via Twitter that this is “often a result of typesetting.” Thanks for that.
Pingback: Twenty million papers in PubMed: a triumph or a tragedy? « O'Really?
Hi.
Could you issue stats on the number of badly formatted URLs found in Pubmed abstract? That could also be informative.
P.
I’m working on it, but figuring out which URLs are invalid URLs is so incredibly tedious, I’m attempted to abandon the PubMed approach altogether. There’s a reason why this stuff does not get analysed and sorted out – it makes people so mad, they just can’t continue…
The blank in the URL is a located where there was a carriage-return of in the PDF !!
…
It’s because they don’t get the whole idea of computer-readable data period, and imagine that humans are the only readers of papers. It’s the same with data included in papers — how many times have you downloaded the data only to find that it is a PDF and not a tab-delimited text file or similar? Sure you can copy and paste from the PDF and try to reconstruct something useable, but why?
Yes, you have summarised the issue very nicely – this is the heart of the problem. I struggle to see how (biological) science can ever embrace “big data” if we don’t make the data available in a form that can be parsed.