Web scraping using Mechanize: PMID to PMCID/NIHMSID

Web services are great. Pass them a URL. Structured data comes back. Parse it, analyse it, visualise it. Done.

Web scraping – interacting programmatically with a web page – is not so great. It requires more code and when the web page changes, the code breaks. However, in the absence of a web service, scraping is better than nothing. It can even be rather satisfying. Early in my bioinformatics career the realisation that code, rather than humans, can automate the process of submitting forms and reading the results was quite a revelation.

In this post: how to interact with a web page at the NCBI using the Mechanize library.

A Biostar question:

I would like to know if I can convert PMID to NIHMSID, if there is no PMCID associated with that PMID and retrieve both of them if PMCID exists. I want to do this programmatically may be using eutils if possible. I know about this link http://www.ncbi.nlm.nih.gov/pmc/pmctopmid/ that will do the work but its not programmatically. I also tried the services by PMC http://www.ncbi.nlm.nih.gov/pmc/tools/oai/#examples but not much of success.

Some definitions. PMID = PubMed ID, an identifier for articles in the PubMed database. PMCID, a similar identifier for articles deposited in PubMed Central. NIHMSID – articles in PubMed Central have one of these identifiers if submitted via the NIH manuscript submission system.

The NCBI provide this web page, but no web service, for converting PMID to PMCID/NIHMSID or PMCID to PMID.

Enter Mechanize. I know of libraries for Perl, Python and Ruby – there may be others. Here’s how to use the Ruby version to submit a form and retrieve the results.

First, load the library and fetch the web page:

require 'mechanize'

agent = Mechanize.new
page  = agent.get("http://www.ncbi.nlm.nih.gov/pmc/pmctopmid/")

If you view the page on the Web, you’ll see that it contains 2 forms: one for a general search at NCBI, the other for ID conversion. You can verify this by searching the Mechanize::Page object for forms and pretty-printing the second one:

puts page.search("form").count
# 2
pp page.search("form")[1]
# output not shown

Next, we want to set values in the form. Let’s get the second form and examine the fields:

form = page.forms[1]
pp form
#<Mechanize::Form
 {name nil}
 {method "POST"}
 {action "/pmc/pmctopmid/"}
 {fields
  [hidden:0xfa4150 type: hidden name: p$a value: ]
  [hidden:0xfa3fe8 type: hidden name: p$l value: PAFAppLayout]
  [hidden:0xfa3e80 type: hidden name: p$st value: pmc]
  [hidden:0xfa3d2c type: hidden name: SessionId value: F4FC1F49237BDA11_0186SID]
  [hidden:0xfa3bc4 type: hidden name: Snapshot value: /projects/PMC/PMCStatic@2.60]
  [textarea:0xd543dc type:  name: PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.Ids value: ]}
 {radiobuttons
  [radiobutton:0xfa4a10 type: radio name: PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.from_db value: from_pmid]
  [radiobutton:0xfa4880 type: radio name: PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.from_db value: from_pmcid]}
 {checkboxes
  [checkbox:0xfa459c type: checkbox name: PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.ToFile value: false]}
 {file_uploads}
 {buttons
  [submit:0xfa4704 type: submit name: Clipboard value: Get IDs from PubMed clipboard]
  [submit:0xfa4420 type: submit name: ConvertButton value: Convert]
  [submit:0xfa42cc type: submit name: ClearButton value: Clear]}>

Compare that with the web page or its source code. We want to do the following:

  • check the radio button “PMID to PMCID (or NIHMSID)”
  • enter PMIDs in the text area
  • check the box for CSV file download

There are various ways to do all of those things:

# select the radio button by value
form.radiobutton_with(:value => "from_pmid").check
# select the text area by name and set values
form["PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.Ids"] = "21707345\n23482678"
# select the CSV checkbox by name
form.checkbox_with(:name => "PAFAppLayout.AppController.Page.PMCToPmidC.MainPortlet.ToFile").check

Here PMIDs are separated by newline but you could also use commas, spaces, semicolons or vertical bars.

Finally, submit the form. There are 3 buttons in the form; we want the second one which is named ConvertButton and has value Convert:

f = agent.submit(form, form.buttons[1])
f.save "ids.txt"

Result – a file named ids.txt, with these contents:

PMID,PMCID,NIHMSID
21707345,-,NIHMSID331689
23482678,PMC3592971,NIHMSID392341

My tip: study the web page and its source code carefully, making note of the names and values for the elements of interest. Those elements will then be much easier to find and alter when you come to work with the Mechanize object representation.

4 thoughts on “Web scraping using Mechanize: PMID to PMCID/NIHMSID

  1. Chris Maloney

    Hi, interesting post!
    We have a web service that backs up that web page, here: http://www.pubmedcentral.nih.gov/utils/idconv.cgi. Unfortunately it wasn’t documented anywhere, so I’m not surprised you weren’t able to find it. We’re in the process of upgrading it, and should have a new version out in a couple of weeks, and will have documentation and links in plain sight. If you’re interested, you can subscribe to our utils-announce mailing list (very low volume): http://www.ncbi.nlm.nih.gov/mailman/listinfo/pmc-utils-announce

    1. nsaunders Post author

      Thanks Chris. I did not investigate at all, since I was really just answering a question from someone else; will pass your comment along to them.

    2. Sammed Mandape

      Hi Chris and Neil,

      Neil – I would like to thank you again for this post. I was trying to duplicate your work in python with this library. Was struggling with some errors. But now I think I don’t have to. I greatly appreciate your help Neil. Thanks a ton.

      Chris – I was trying to do webscraping using your js code on the PMC website for doing this conversion automatically. I was struggling for past one week to get this done. This is really important for my Institution to keep track of NIHMS ids for NIH grants. And you did wonder by letting me know about this eutils. Thanks Chris. I greatly appreciate your time and assistance. I will definitely subscribe for the mailing list.

      Though a quick question for you. Is there a limit for the number of ids that can be given as input? I am sure this will be mentioned in the documentation when this will be out, but just to play around for now.

      Thank you, again, both of you for this magic! :)

Comments are closed.