Web scraping using Mechanize: PMID to PMCID/NIHMSID

Web services are great. Pass them a URL. Structured data comes back. Parse it, analyse it, visualise it. Done.

Web scraping – interacting programmatically with a web page – is not so great. It requires more code and when the web page changes, the code breaks. However, in the absence of a web service, scraping is better than nothing. It can even be rather satisfying. Early in my bioinformatics career the realisation that code, rather than humans, can automate the process of submitting forms and reading the results was quite a revelation.

In this post: how to interact with a web page at the NCBI using the Mechanize library.

Read the rest…

How to: remember that you once knew how to parse KEGG

Recently, someone asked me if I could generate a list of genes associated with a particular pathway. Sure, I said and hacked together some rather nasty code in R which, given a KEGG pathway identifier, used a combination of the KEGG REST API, DBGET and biomaRt to return HGNC symbols.

Coincidentally, someone asked the same question at Biostar. Pierre recommended the TogoWS REST service, which provides an API to multiple biological data sources. An article describing TogoWS was published in 2010.

An excellent suggestion – and one which, I later discovered, I had bookmarked. Twice. As long ago as 2008. This “rediscovery of things I once knew” happens to me with increasing frequency now, which makes me wonder whether (1) we really are drowning in information, (2) my online curation tools/methods require improvement or (3) my mind is not what it was. Perhaps some combination of all three.

Anyway – using Ruby (1.8.7), a list of HGNC symbols given a KEGG pathway, e.g. MAPK signaling, is as simple as:

require 'rubygems'
require 'open-uri'
require 'json/pure'

j = JSON.parse(open("http://togows.dbcls.jp/entry/pathway/hsa04010/genes.json").read)
g = j.first.values.map {|v| /^(.*?);/.match(v)[1] }
# first 5 genes
# ["MAP3K14", "FGF17", "FGF6", "DUSP9", "MAP3K6"]

This code parses the JSON returned from TogoWS into an array with one element; the element is a hash with key/value pairs of the form:

"9020"=>"MAP3K14; mitogen-activated protein kinase kinase kinase 14 [KO:K04466] [EC:]"

Values for all keys that I’ve seen to date begin with the HGNC symbol followed by a semicolon, making extraction quite straightforward with a simple regular expression.

A brief note: R 3.0.0 and bioinformatics

Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post).

Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the simpleaffy package from Bioconductor to normalize Affymetrix exon microarrays. I began as usual by reading the CEL files:

f <- list.files(path = "data/affyexon", pattern = ".CEL.gz", full.names = T, recursive = T)
cel <- ReadAffy(filenames = f)

When this happened:

Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData,  : 
  allocMatrix: too many elements specified

I had a relatively-large number of samples (337), but figured a 64-bit machine with ~ 100 GB RAM should be able to cope. I was wrong: due to a hard-coded limit to vector length in R, my matrix had become too large regardless of available memory. See this post and this StackOverflow question for the computational details.

My solution at the time was to resort to Affymetrix Power Tools. Hopefully, the introduction of the LONG vector will make Bioconductor even more capable and useful.

Basic R: rows that contain the maximum value of a variable

File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”

Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
#    vars obs1 obs2
# 1     A    1   11
# 2     B    2   12
# 3     C    3   13
# 4     D    4   14
# 5     E    5   15
# 6     A    6   16
# 7     B    7   17
# 8     C    8   18
# 9     D    9   19
# 10    E   10   20

Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:
Read the rest…

Addendum to yesterday’s post on custom CSS and R Markdown

Updates from RStudio support:
(1) “Thanks for reporting and I was able to reproduce this as well. I’ve filed a bug and we’ll take a look.”
(2) Taking a further look, this is actually a bug in the Markdown package and we’ve asked the maintainer (Jeffrey Horner) to look into it.

As juejung points out in a comment on my previous post, applying custom CSS to R Markdown by sourcing the custom rendering function breaks the rendering of inline equations.

I’ve opened an issue with RStudio support and will update here with their response. In the meantime, one solution to this problem is:

  1. Do not create the files custom.css or style.R, as described yesterday
  2. Instead, just put the custom CSS at the top of your R Markdown file using style tags, as shown below
<style type="text/css">
table {
   max-width: 95%;
   border: 1px solid #ccc;

th {
  background-color: #000000;
  color: #ffffff;

td {
  background-color: #dcdcdc;

Custom CSS for HTML generated using RStudio

Update August 5 2014: I noticed this post is getting some hits; please note that it is an old post, it’s probably outdated and there’s likely to be a better solution by now

People have been telling me for a while that the latest version of RStudio, the IDE for R, is a great way to generate reports. I finally got around to trying it out and for once, the hype is justified. Start with this excellent tutorial from Jeremy Anglim.

Briefly: the process is not so different to Sweave, except that (1) instead of embedding R code in LaTeX, we embed R code in a document written using R Markdown; (2) instead of Sweave, we use the knitr package; (3) the focus is on generating HTML documents for publishing to the Web (see e.g. RPubs), although knitr can also generate PDF documents, just like Sweave.

It took me a little while to figure out a couple of things. First, how best to generate HTML tables, ideally using the xtable package. Second, how to override the default RStudio/R Markdown style. I’ve documented those tasks in this post.
Read the rest…

PMRetract: now with rake tasks

Bioinformaticians (and anyone else who programs) love effective automation of mundane tasks. So it may amuse you to learn that I used to update PMRetract, my PubMed retraction notice monitoring application, by manually running the following steps in order:

  1. Run query at PubMed website with term “Retraction of Publication[Publication Type]”
  2. Send results to XML file
  3. Run script to update database with retraction and total publication counts for years 1977 – present
  4. Run script to update database with retraction notices
  5. Run script to update database with retraction timeline
  6. Commit changes to git
  7. Push changes to Github
  8. Dump local database to file
  9. Restore remote database from file
  10. Restart Heroku application

I’ve been meaning to wrap all of that up in a Rakefile for some time. Finally, I have. Along the way, I learned something about using efetch from BioRuby and re-read one of my all-time favourite tutorials, on how to write rake tasks. So now, when I receive an update via RSS, updating should be as simple as:

rake pmretract

In other news: it’s been quiet here, hasn’t it? I recently returned from 4 weeks overseas, packed up my office and moved to a new building. Hope to get back to semi-regular posts before too long.

R gotcha for the week

I use the biomaRt package from Bioconductor in almost every R session. So I thought I’d load the library and set up a mart instance in my ~/.Rprofile:

mart.hs <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

On starting R, I was somewhat perplexed to see this error message:

Error in bmVersion(mart, verbose = verbose) : 
  could not find function "read.table"

Twitter to the rescue. @hadleywickham told me to load utils first and @vsbuffalo explained that normally, .Rprofile is read before the utils package is loaded. Seems rather odd to me; I’d have thought that biomaRt should load utils if required, but there you go.

So this works in ~/.Rprofile:

mart.hs <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")