PubMed searching: experiments in Javascript

Andrew asks:

…anyone know of a tool that will take a pubmed query and plot the number of articles by year?

I figured that this was a good excuse to improve my lowly Javascript skills by building a toy web application.

First, video proof that I did get something to work, up to a point:

Next, some code. I created a Sinatra application, with the directory structure shown. It’s fairly simple: one main file, app.rb, a “spinner” graphic to indicate loading operations, the jQuery and Highcharts Javascript libraries and one view, wrapped in a layout.
pmtree

Sinatra pubmed application file tree

Next, the code in app.rb. It’s about as simple as it gets:

require "rubygems"
require "sinatra"
require "haml"

get "/" do
  haml :index
end

Layout, controlled by layout.haml, simply loads the javascripts and creates a DIV element for content:

!!! XML
!!!

%html
  %head
    %title PubMed terms by year
    %script{:type => "text/javascript", :src => "/javascripts/jquery.js"}
    %script{:type => "text/javascript", :src => "/javascripts/highcharts.js"}
    %script{:type => "text/javascript", :src => "/javascripts/exporting.js"}
  %body
    %div
      = yield

The action happens in index.haml. First, the elements for content:

%div
  %span{:id => "myform"}
    %input{:type => "text",   :name  => "terms",  :id => "terms"}
    %input{:type => "submit", :value => "Search", :id => "button"}
  %span{:id => "loader"}
    %img{:src => "images/spinner.gif"}

%div{:id => "container"}

Since there is no server-side (Ruby) processing, I didn’t want to mess around with form submission, so I simply included input and button elements without wrapping them in a form. This is probably very bad practice with regard to valid HTML but this is just a toy application, so I don’t care very much.

Next, the Javascripts. Ideally, these should be saved in public/javascripts and loaded by the layout but for testing purposes, I wrote them inline. Please bear with me, I’m not a great Javascript programmer.

The first one simply hides the content (the animated spinner GIF) of the element with ID loader, except when an AJAX process is running. Thanks to nickf at StackOverflow for that tip.

:javascript
  $('#loader')
      .hide()  // hide it initially
      .ajaxStart(function() {
          $(this).show();
      })
      .ajaxStop(function() {
          $(this).hide();
      });

The second runs the PubMed query, parses the results and plots a chart of publications by year.

:javascript
  $("#button").click(function() {
    var terms = $("#terms").val();
    var dates = [];
    var d = [];
    var args  = {'apikey' : 'YOUR ENTREZ-AJAX API KEY',
                 'db'     : 'pubmed',
                 'term'   :  terms,
                 'retmax' : 5000,          // maximum number of results from Esearch
                 'max'    : 5000,          // maximum number of results passed to Esummary
                 'start'  : 0
                 };
      $.getJSON('http://entrezajax.appspot.com/esearch+esummary?callback=?', args, function(data) {
      if(data.entrezajax.error == true) {
        $("#container").html('<p>' + 'Sorry - EntrezAjax failed with error ' + data.entrezajax.error_message + '</p>');
        return;
        }
      $.each(data.result, function(i, item) {
        var date = item.PubDate;
        dates.push(/^\d{4}/.exec(date));
      });
      // count by year
      var count = {};
      for(i in dates)
        if(count[dates[i]]) {
          count[dates[i]]++;
          }
        else {
          count[dates[i]] = 1;
        }
      // create data array
      for(i in count)
        d.push([Date.UTC(i, 0, 1), count[i]]);
      // build chart
      var options = {
        chart: {
          renderTo: 'container',
          defaultSeriesType: 'column',
          width : 900
        },
        title : {
          text : terms + ' - ' + dates.length + ' total'
        },
        legend : {
          enabled : false
        },
        credits : {
          enabled : false
        },
        tooltip : {
          formatter: function() {
                                 return Highcharts.dateFormat('%Y', this.x) + ' : ' + this.y + ' entries';
                               }
              },
        xAxis : {
          type : 'datetime',
          dateTimeLabelFormats : {
            year : '%Y'
            }
        },
        yAxis : {
          title : { text : 'Entries' }
        },
        series: [{
          data: d
        }]
      };
      var chart = new Highcharts.Chart(options);
    });
  });

Let’s work through that. Searching happens in lines 2-17. When the button is clicked, the search terms are passed to Entrez-AJAX and a result is returned. If there was an error, the error message is displayed in the DIV with ID = ‘container’ and the program stops. See the Entrez-AJAX API documentation for the details.

Results are parsed in lines 18-33. The value of PubDate is extracted for each record. If it begins with 4 digits (the year), it’s pushed onto an array (dates); otherwise, the record is not counted. Next, we step through the array and build an object, count, where the keys are years and the values the sum of publications for that year. Finally, we step through the object and build an array (d) with elements of the form “[Date.UTC(YYYY, 0, 1), N]”. Here, YYYY is the year and N the count of publications for that year. The month (0) and day (1) are arbitrary; they could also (and perhaps should) be 11 (Dec) and 31.

The chart is built in lines 35-68. First we create the options, then add them to a new Highcharts.Chart object, which is rendered in DIV ID = ‘container’. Refer to the Highcharts documentation for the details.

In the video clip, you’ll note that the final query fails with “ApplicationError : 1”. This seems to happen when the query returns too many results. In fact only queries that return less than a couple of hundred or so results seem to work with Entrez-AJAX. This prompted me to tweet:

yes, these fancy-pants ajaxified web apps are all very well but to do real work, you need to download full datasets

Don’t get me wrong: I’m not criticising Entrez-AJAX, which is a great piece of work. It’s just not designed to retrieve a large number of results and neither is any other web application; it would simply take too long. I could, for example, write a function in app.rb to run Esearch and Esummary, fetch and parse the XML results. However, for queries returning a large number of results, a user would be staring at a blank web page for a very long time. Life would be easier if NCBI improved their API so as only specified fields (such as dates) were returned, but that seems unlikely at present.

The take-home message: if you want publications by year for a particular topic, don’t expect a web application to fetch the data on the fly. You’ll need either a local database or else just write some R code that you can leave running in the background as you do something else.

7 thoughts on “PubMed searching: experiments in Javascript

  1. Jonathan Badger

    I love Sinatra. Some day I will actually get around to learning Rails, but Sinatra works for the sort of simple Web apps I need, and I rather like the idea of being a nearly 100% Ruby person who is not a Rails programmer, as so many people equate the two.

    1. nsaunders Post author

      I absolutely agree on all counts. Sinatra is just great for throwing together a toy app in no time at all. When you’re “more Ruby than Rails”, you realise the extent to which all those useful Rails helper methods rely on polluting the Ruby namespace. That said, I like Rails too for those projects that warrant the investment.

  2. alf

    You could also do a separate query for each year (the DP field in PubMed) and use the matched articles count from the ESearch response, avoiding having to fetch all the data.

  3. Nick Loman

    Neil, unfortunately EntrezAJAX is bound by several limitations which makes life a bit harder than it needs to be. Particuarly those enforced by Google App Engine; the length of time retrieving the results (10 seconds), and the size of the data returned (10Mb). But both of these limits are reasonable for a standard application where all the results are being returned to the user (rather than an aggregation in this case).

    I agree with Alf’s suggestion to create a search for each year of interest. If you do this you can actually trim down the retmax/max count parameters to 20 (or even 0) and use the ‘count’ field in the returned dictionary for the data point. This should work a lot better.

    Ideally Entrez would supply a more appropriate method for this kind of data-mining exercise which I could then provide access to via EntrezAJAX.

    1. nsaunders Post author

      Thanks Nick, for the comment and useful tips. As you say, we are bounded by GAE and the limitations of the Entrez API, when it comes to larger datasets.

  4. Pingback: Creating your own JSON Endpoints for Bio Web Services: Basics | Abhishek Tiwari

Comments are closed.