PMRetract: PubMed retraction reporting rewritten as an interactive RMarkdown document

Back in 2010, I wrote a web application called PMRetract to monitor retraction notices in the PubMed database. It was written primarily as a way for me to explore some technologies: the Ruby web framework Sinatra, MongoDB (hosted at MongoHQ, now Compose) and Heroku, where the app was hosted.

I automated the update process using Rake and the whole thing ran pretty smoothly, in a “set and forget” kind of way for four years or so. However, the first era of PMRetract is over. Heroku have shut down git pushes to their “Bamboo Stack” – which runs applications using Ruby version 1.8.7 – and will shut down the stack on June 16 2015. Currently, I don’t have the time either to update my code for a newer Ruby version or to figure out the (frankly, near-unintelligible) instructions for migration to the newer Cedar stack.

So I figured now was a good time to learn some new skills, deal with a few issues and relaunch PMRetract as something easier to maintain and more portable. Here it is. As all the code is “out there” for viewing, I’ll just add few notes here regarding this latest incarnation.

  1. Writing in RMarkdown has several advantages:
    • There are the usual advantages of literate documents – seeing the code together with the results, reproducibility.
    • Parsing PubMed XML files directly using R is an easier, more “lightweight” process than storage, retrieval and visualisation via a dedicated database.
    • The output is a single HTML file which is easy to distribute or host: for example here at Github and here, published to Rpubs using RStudio. Grab it yourself, use it however you like.
  2. There are a couple of slow procedures (several minutes) that are better run from separate R scripts than from the RMarkdown document, for debugging purposes. These are (a) downloading PubMed XML and (b) retrieving total articles per year across five decades. Those scripts are here at Github. The RMarkdown document then reads their output.
  3. This project allowed me to explore the rCharts package. I had long wondered why, given the excellent plotting capabilities of R, anyone would want to provide a wrapper to javascript plotting libraries. The answer of course is that with tools such as RMarkdown, we can generate documents in HTML format where interactive javascript shines.
  4. Highcharts is still my library of choice. I know the cool kids use D3 but (a) I know Highcharts better and (b) I find the transformation between data and its graphical representation most intuitive in Highcharts. That’s just how my brain works, not a reflection of the other libraries.
  5. The publishing procedure is not quite so fully-automated as it was using Rake; this shell script is my best attempt so far. However, it’s easy enough to compile and publish the document using RStudio whenever the notification feed updates.
  6. A couple of enhancements:
    • The clunky, confusing zoomable timeline showing retractions on specific dates has been replaced by a non-zoomable version showing retraction counts per year.
    • There’s always been some confusion as to whether we’re looking at data for retracted articles or their associated retraction notices – so now both types of data are shown, in separate clearly-labelled and coloured plots.

That’s it, more or less. Enjoy and let me know what you think.

4 thoughts on “PMRetract: PubMed retraction reporting rewritten as an interactive RMarkdown document

  1. sjroyle7

    This is really nice and very useful. Looking at Section 4, it would be good to be able to see number of retractions at each journal normalised to the number of papers they publish. I was just thinking: JBC is out in front, but then it publishes a huge number of papers every week. I guess it is not so simple to do, because the volume of papers changes over time…

    1. nsaunders Post author

      Great suggestion, I’ve thought about that before. Problem is that number of retractions is very small relative to total publications so you end up with tiny numbers that are probably meaningless. I should add a note in the report to justify section 4.

      EDIT: OK, I need to think about this some more. Obviously dividing retracted by total is not sensible, but something like per 100K total as in the other sections could work.

      EDIT: Done! see the new HTML file and upcoming blog post…

  2. Richard Van Noorden

    Neil, can you try to separate retraction notices per journal by date bracket? When I wrote a feature about retractions I noted that the big journals (Nature, Science) seemed to be retracting approx the same number during 2006-10 as they were in 2000-2005. The reason for the sudden rise seemed to be that all kinds of new, smaller journals that had rarely retracted articles in 2000-2005 upped their game in 2006-10. It would be interesting to see this mapped out. (Except that we are only talking about PubMed, not all retraction notices, but still useful).

Comments are closed.