Lists of URLs are so 1990s

Subtitle: “Why some projects are not worth your valuable time and skills.”

Let’s wrap up this exploration of how to extract URLs associated with NAR Database articles. I’m tempted to start with the summary: don’t bother – just Google it. If you want that, skip to the end.

First: forget PubMed. This query:

"Nucleic Acids Res"[JOUR] "Database issue"[ISS]

is all well and good except that between 1998 and 2004, the NAR Database Issue was not named “Database issue”; it just has a volume number, like the other issues.

If you insist on trying to extract URLs from PubMed abstracts, be prepared for frustration. They include every valid variant imaginable (with/without a leading “http://”) and many invalid variants (broken up by spaces, missing forward slashes). About the most reliable regular expression is:

/(\w+\.){1,}(\w+)/

Which says “find words connected by periods”. After which you can throw away “http://S.cerevisae”, “http://i.e.”, “http://v2.1″…you get the picture.

No, I’m afraid your best starting point is NAR’s own summary page, from where you can access alphabetical, category and “by paper” lists. Downloadable as a file? Tell him he’s dreaming. You, my friend, are going to scrape web pages. Yes, you heard that right – web scraping is the better option. Bring on the code:

#!/usr/bin/ruby

require "rubygems"
require "mechanize"
require "bio"
require "logger"

agent = Mechanize.new
base  = "http://www.oxfordjournals.org"
url   = base + "/nar/database/cap/"
log   = Logger.new('urls.log')
page  = agent.get(url)

page.links_with(:href => /\/nar\/database\/summary\//).each do |link|
  title   = link.text
  summary = base + link.uri.to_s
  page    = agent.get(summary)
  url     = page.search("//div[@class='bodytext']/a").first.values.first
  date    = parse_date(agent,page)
  resp    = parse_response(url)
  log.debug "#{title},#{url},#{date}," + resp
end

The parse_response method needs to check all manner of Net::HTTP errors:

def parse_response(url)
  begin
    r = Net::HTTP.get_response(URI.parse(url))
    m = "#{r.code},#{r.message}"
  rescue NoMethodError, SocketError, URI::InvalidURIError, TimeoutError,
    Net::HTTPBadResponse, Errno::ETIMEDOUT, Errno::ECONNRESET,
    Errno::ECONNREFUSED, Errno::ENETUNREACH, Errno::EHOSTUNREACH, EOFError => e
    m = "#{e.class.to_s},#{e.message}"
  end
  return m
end

Getting the year and the PMID? Sometimes year appears on the NAR abstract page, sometimes not. Sometimes, the NAR abstract page does not even exist. Best just to follow the PMID to PubMed and pull publication year from the Medline record:

def parse_date(agent,page)
  date = ""
  pmid = ""
  abstract = page.links_with(:href => /\/cgi\/content\/abstract\//)
  if abstract.count > 0
    begin
      doc  = agent.get(abstract.first.uri.to_s)
      pm   = doc.links_with(:href => /\/external-ref\?access_num=(\d+)&link_type=PUBMED/)
      if pm.count > 0
        if pm.first.uri.to_s =~ /access_num=(\d+)&/
          pmid = $1
          med  = Bio::MEDLINE.new(Bio::PubMed.pmfetch(pmid))
          date = med.year
        end
      end
    rescue Mechanize::ResponseCodeError => e
      return "#{date},#{pmid}"
    end
  end
  return "#{date},#{pmid}"
end

And when that’s done, several hours later, you may if you’re lucky be able to pull output from the log file which looks like this:

DDBJ - DNA Data Bank of Japan,http://www.ddbj.nig.ac.jp,2011,21062814,200,OK
EBI patent sequences,http://www.ebi.ac.uk/patentdata/nr/,2010,19884134,200,OK
European Nucleotide Archive,http://www.ebi.ac.uk/ena/,2011,20972220,200,OK
GenBank®,http://www.ncbi.nlm.nih.gov/,2011,21071399,200,OK
ACLAME - A Classification of Mobile genetic Elements,http://aclame.ulb.ac.be/,2010,19933762,200,OK
AREsite,http://rna.tbi.univie.ac.at/AREsite,2011,21071424,301,Moved Permanently
...

Except that some of the titles will contain commas or quotes. And 3 of the “URLs” will return “NoMethodError”, because they are either FTP sites or email addresses. Once you’re done with manual editing, you can make your CSV file publicly available. You could do some statistics in R:

urls <- read.table("urls.csv", header = T, stringsAsFactors = F, sep = ",", comment.char = "", quote = "")
as.data.frame(table(urls$code))
                   Var1 Freq
1                   200  941
2                   220    1
3                   301  271
4                   302  185
5                   307    4
6                   400    5
7                   403    3
8                   404   68
9                   500    2
10                  502    1
11                  503    4
12                  504    3
13          ENETUNREACH    1
14             EOFError    2
15  Errno::ECONNREFUSED    8
16    Errno::ECONNRESET    1
17  Errno::EHOSTUNREACH   16
18     Errno::ETIMEDOUT   52
19 Net::HTTPBadResponse    5
20          SocketError   33
21       Timeout::Error    7
22 URI::InvalidURIError    1

Updated statistics? You’ll have to go through all that again.

Next you might create a database, throw a web interface on top and release your application to the world. That was my original intention, until I had an epiphany:

None of this is really very useful at all
I tweeted, in frustration:

So very sick of the “NAR Database Issue URLs project”. Know what? Search Google with keywords+”database” instead. You’ll get there quicker.

And when the frustration wore off I realised that although half-joking, I was right.

With all this effort, what do we have? A big list of URLs: many of which work, some of which don’t and most of which are probably irrelevant, in terms of what interests you right now. Tomorrow, those that currently work may not and vice-versa. This is the nature of the Web; it’s a dynamic ecosystem where resources come and go and may not be of much use even if they are accessible.

You want a database? Use search, with appropriate keywords + “database”.

Big lists of URLs are for dummies. Use search instead.