Monitoring PubMed retractions: a Heroku-hosted Sinatra application

In a previous post analysing retractions from PubMed, I wrote:

It strikes me that it would be relatively easy to build a web application (Rails, Heroku), which constantly monitors retraction data at PubMed and generates a variety of statistics and charts.

“Relatively easy” it was. Let me introduce you to PMRetract, my first publicly-available web application.

1. The application
First, here is the application. It’s hosted at Heroku, with a MongoDB database provided by MongoHQ; more on those later. I’m using the free plan in each case. To date, the application has held up pretty well but you may get the occasional timeout – more than occasional if thousands of people start to hit the application at once. Which I doubt they will.
2. How it works
The application is written using Sinatra and provides five views – currently, they look quite rough but they’re functional – named Timeline, Cumulative, By Year, Journals and Date.

Timeline is the most dynamic of the views. Watch the video at right (full-screen or at YouTube is best) for a demonstration. The initial view is a timeline for all retractions in the database, starting from 1977. The user clicks and drags to select a rough date range, repeating the process until they see a record of interest. Clicking on the date under the record opens a new browser window (or tab) with the Date view: it shows links to the retraction notice and, where available, the retracted publications. A link is available on the chart to reset the zoom.
The cumulative view is a dual-axis chart (I know, bad practice). It shows the cumulative sum of PubMed articles since 1977 (green columns) and the cumulative proportion of retractions, expressed as retractions per 100 000 articles (blue line). Note that articles prior to 1977 (the year of the first retraction) are not included, hence there are around 16 000 000 articles, as opposed to about 20 000 000 in all of PubMed.

PubMed retractions, cumulative data, 1977-

The “By Year” view is very similar to the cumulative view. The difference is that the numbers are for each year in turn, rather than the cumulative sum of each year plus previous years. For example, hovering over the year 2000 shows that there were a total of 527 085 articles that year, of which about 4.36 per 100 000 were retracted: in fact, a fall compared with the previous two years.

PubMed retractions by year, 1977-

Finally, the Journals view shows the top twenty journals, ordered with respect to total number of retraction notices. Hovering over a bar shows the journal abbreviation and count.

PubMed retractions: top 20 journals

3. Implementation and deployment
Code and data are available in my PubMed Github repository; the Sinatra application code is in this directory, so I won’t go into too much detail here.

3.1 Database
The MongoDB database, pubmed, contains 3 collections: ecount, entries and timeline. These are generated and updated by running 3 corresponding Ruby scripts sequentially: ecnt2mongo.rb, xml2mongo.rb and timeline.rb. The timeline script/collection pre-calculates the data for the Timeline view, to speed up page rendering. Sample records from each collection look like this:

# db.ecount.findOne()
{ "_id" : 1977, "retracted" : 3, "total" : 260167, "year" : 1977 }
# db.entries.findOne()
{"_id"=>"21089224", "PubmedData"=>{"PublicationStatus"=>"ppublish", "ArticleIdList"=>{"ArticleId"=>"21089224"}, "History"=>{"PubMedPubDate"=>[{"Minute"=>"0", "Month"=>"11", "PubStatus"=>"entrez", "Day"=>"20", "Hour"=>"6", "Year"=>"2010"}, {"Minute"=>"0", "Month"=>"11", "PubStatus"=>"pubmed", "Day"=>"26", "Hour"=>"6", "Year"=>"2010"}, {"Minute"=>"0", "Month"=>"11", "PubStatus"=>"medline", "Day"=>"26", "Hour"=>"6", "Year"=>"2010"}]}}, "MedlineCitation"=>{"Status"=>"In-Process", "CitationSubset"=>"IM", "Owner"=>"NLM", "Article"=>{"PubModel"=>"Print", "ArticleTitle"=>"Retraction.", "Pagination"=>{"MedlinePgn"=>"894"}, "Language"=>"eng", "PublicationTypeList"=>{"PublicationType"=>"Retraction of Publication"}, "Journal"=>{"ISOAbbreviation"=>"Echocardiography", "JournalIssue"=>{"Issue"=>"7", "CitedMedium"=>"Internet", "PubDate"=>{"Month"=>"Aug", "Year"=>"2010"}, "Volume"=>"27"}, "Title"=>"Echocardiography (Mount Kisco, N.Y.)", "ISSN"=>"1540-8175"}}, "CommentsCorrectionsList"=>{"CommentsCorrections"=>{"PMID"=>"19490015", "RefSource"=>"Suma V, Makaryus AN, Rascon M, Doddamani S, Fan D, Boxt LM. Echocardiography. 2009 Jul;26(6):732-5", "RefType"=>"RetractionOf"}}, "PMID"=>"21089224", "DateCreated"=>{"Month"=>"11", "Day"=>"16", "Year"=>"2010"}, "MedlineJournalInfo"=>{"Country"=>"United States", "ISSNLinking"=>"0742-2822", "MedlineTA"=>"Echocardiography", "NlmUniqueID"=>"8511187"}}}
# db.timeline.findOne()
	"_id" : "Date_UTC(1977,7,12)",
	"date" : "Date.UTC(1977,7,12)",
	"count" : 1

3.2 Sinatra
The sinatra code uses one main application file, main.rb which connects to the database, does calculations on the collections and specifies the views. A second file, statistics.rb, provides methods for additional formatting. Each view is a HAML file containing a :javascript directive and the code to generate a chart, using the Highcharts library. Here’s the timeline view code as an example.

3.3 Heroku and MongoHQ
Deployment to Heroku was not quite so straightforward as git push heroku master but was not far off; the main issue was just debugging a few files before deployment.

Adding a MongoHQ-hosted MongoDB database was quite straightforward, using these instructions. Tips for working directly with your MongoHQ database can be found in this support forum discussion.

Currently, I update a local MongoDB database first, just to check that there are no issues. I then dump the data and upload it to MongoHQ as follows:

mongodump -d pubmed -o dump/
mongorestore -h -d DBNAME -u USERNAME -pPASSWORD --drop dump/pubmed

I had early issues with 30-second timeouts at Heroku, when each page refresh resulted in a new call to the database. This seems to be fixed (for now), by moving all database calls to the configure…do…end block in main.rb. In this way, the database calls are performed only once, at application startup. Of course, this means that the application should be restarted if the database is updated.

4. Bugs, updates and improvements
There are sure to be bugs; in particular, there are retraction records that may be edge cases on which the record formatting code fails. There are also no tests just now. The sinatra code is pretty ugly, but it works. I’ve reached the point where things work, mostly, as expected for me and if you don’t go public at that point, you never will.

A notably-absent function is search. This is in part intentional; the application was designed as a timeline browser. However, it would be useful if titles/abstracts could be searched and matching records returned; this may or may not be added at a later date.

The process of updating the database is currently rather cumbersome. I’m working on a rake task to automate updates and work directly with the MongoHQ-hosted database. Until then, I try to update once a week or so.

Of course, there’s always room for additional features. For example, it would be useful if the Journals view linked to retraction records from those particular journals. Feel free to make suggestions and don’t be offended if I don’t respond to them. I have a day job, you know :-)

5. Summary
This little project has been a useful learning exercise and I hope, provides a useful application. I’m quite impressed with the free Heroku and MongoHQ plans (16 MB database storage for the latter), which are more than adequate for small, fun projects. Hopefully, this will be the first in a long line of such experiments.

One thought on “Monitoring PubMed retractions: a Heroku-hosted Sinatra application

  1. Pingback: Episode 35: The Grinch didn’t steal c2cbio « Coast to Coast Bio Podcast

Comments are closed.