Let’s (briefly) revisit the Nobel API

It’s always nice when 12-month old code runs without a hitch. Not sure why this did not become a Github repo first time around, but now it is: my RMarkdown code to generate a report using data from the Nobel Prize API.

Now you too can generate a “gee, it’s all old white men” chart as seen in The EconomistGreying of the Nobel laureates, BBC NewsWhy are Nobel Prize winners getting older? and no doubt, many other outlets every year including me, updated from 2015. As for myself, perhaps I should be offering my services to news outlets instead of publishing on blogs and obscure web platforms :)

R and the Nobel Prize API

The Nobel Prizes. Love them? Hate them? Are they still relevant, meaningful? Go on admit it, you always imagined you would win one day.

Whatever you think of them, the 2015 results are in. What’s more, the good people of the Nobel Foundation offer us free access to data via an API. I’ve published a document showing some of the ways to access and analyse their data using R. Just to get you started:

library(jsonlite)
u     <- "http://api.nobelprize.org/v1/laureate.json"
nobel <- fromJSON(u)

In this post, just the highlights. Click the images for larger versions.
Continue reading

When is db=all not db=all? When you use Entrez ELink.

Just a brief technical note.

I figured that for a given compound in PubChem, it would be interesting to know whether that compound had been used in a high-throughput experiment, which you might find in GEO. Very easy using the E-utilities, as implemented in the R package rentrez:

library(rentrez)
links <- entrez_link(dbfrom = "pccompound", db = "gds", id = "62857")
length(links$pccompound_gds)
# [1] 741

Browsing the rentrez documentation, I note that db can take the value “all”. Sounds useful!

links <- entrez_link(dbfrom = "pccompound", db = "all", id = "62857")
length(links$pccompound_gds)
# [1] 0

That’s odd. In fact, this query does not even link pccompound to gds:

length(names(links))
# [1] 39
which(names(links) == "pccompound_gds")
# integer(0)

It’s not a rentrez issue, since the same result occurs using the E-utilities URL.

The good people at ropensci have opened an issue, contacting NCBI for clarification. We’ll keep you posted.

Farewell FriendFeed. It’s been fun.

I’ve been a strong proponent of FriendFeed since its launch. Its technology, clean interface and “data first, then conversations” approach have made it a highly-successful experiment in social networking for scientists (and other groups). So you may be surprised to hear that from today, I will no longer be importing items into FriendFeed, or participating in the conversations at other feeds.

Here’s a brief explanation and some thoughts on my online activity in the coming months.
Read the rest…

APIs have let me down part 2/2: FriendFeed

In part 1, I described some frustrations arising out of a work project, using the Array Express API. I find that one way to deal mentally with these situations is to spend some time on a fun project, using similar programming techniques. A potential downside of this approach is that if your fun project goes bad, you’re really frustrated. That’s when it’s time to abandon the digital world, go outside and enjoy nature.

Here then, is why I decided to build another small project around FriendFeed, how its failure has led me to question the value of FriendFeed for the first time and why my time as a FriendFeed user might be up.
Read the rest…

APIs have let me down part 1/2: ArrayExpress

The API – Application Programming Interface – is, in principle, a wonderful thing. You make a request to a server using a URL and back come lovely, structured data, ready to parse and analyse. We’ve begun to demand that all online data sources offer an API and lament the fact that so few online biological databases do so.

Better though, to have no API at all than one which is poorly implemented and leads to frustration? I’m beginning to think so, after recent experiences on both a work project and one of my “fun side projects”. Let’s start with the work project, an attempt to mine a subset of the ArrayExpress microarray database.
Read the rest…

How to: archive data via an API using Ruby and MongoDB

I was going to title this post “How to: archive a FriendFeed feed in MongoDB”. The example code does just that but (a) I fear that this blog suggests a near-obsession with FriendFeed (see tag cloud, right sidebar) and (b) the principles apply to any API that returns JSON. There are rare examples of biological data with JSON output in the wild, e.g. the ArrayExpress Gene Expression Atlas. So I’m still writing a bioinformatics blog ;-)

Let’s go straight to the code:

#!/usr/bin/ruby

require "rubygems"
require "mongo"
require "json/pure"
require "open-uri"

# db config
db  = Mongo::Connection.new.db('friendfeed')
col = db.collection('lifesci')

# fetch json
0.step(9900, 100) {|n|
  f = open("http://friendfeed-api.com/v2/feed/the-life-scientists?start=#{n}&amp;num=100").read
  j = JSON.parse(f)
  break if j['entries'].count == 0
  j['entries'].each do |entry|
    if col.find({:_id =&gt; entry['id']}).count == 0
      entry[:_id] = entry['id']
      entry.delete('id')
      col.save(entry)
    end
  end
  puts "Processed entries #{n} - #{n + 99}", "Database contains #{col.count} documents."
}

puts "No more entries to process. Database contains #{col.count} documents."

Also available as a gist. Fork away.

A quick run-through. Lines 4-6 load the required libraries: mongo (the mongodb ruby driver), json and open-uri. If you don’t have the first two, simply “gem install mongo json_pure”. Of course, you’ll need to download MongoDB and have the mongod server daemon running on your system.

Lines 9-10 connect to the database (assuming a standard database installation). Rename the database and collection as you see fit. Both will be created if they don’t exist.

The guts are lines 12-25. A loop fetches JSON from the FriendFeed API, 100 entries at a time (0-99, 100-199…) up to 9999. That’s an arbitrarily-high number, to ensure that all entries are retrieved. Change “the-life-scientists” in line 14 to the feed of your choice. The JSON is then parsed into a hash structure. In lines 17-23 we loop through each entry and extract the “id” key, a unique identifier for the entry. This is used to create the “_id” field, a unique identifier for the MongoDB document. If a document with _id == id does not exist we create an _id key in the hash, delete the (now superfluous) ‘id’ key and save the document. Otherwise, the entry is skipped.
At some point the API will return no more entries: { “entries” : [] }. When this happens, we exit the block (line 16) and print a summary.

That’s it, more or less. Obviously, the script would benefit from some error checking and more options (such as supplying a feed URL as a command line option). For entries with attached files, the file URL but not the attachment will be saved. A nice improvement would be to fetch the attachment and save it to the database, using GridFS.

Possible uses: a simple archive, a backend for a web application to analyse the feed.

Reblog this post [with Zemanta]