APIs: I wish the life sciences would learn from social networks

I was prompted by a thread on the apparent decline of FriendFeed to look for evidence of declining participation in my networks.

First, a quick and dirty Ruby script, tls.rb to grab the Life Scientists feed and count the likes and comments:


require 'rubygems'
require 'json/pure'
require 'net/http'
require 'open-uri'

def format_date(d)
  if d =~ /(\d{4}-\d{2}-\d{2})T(\d{2}:\d{2}:\d{2})Z/
    return "#{$1},#{$2}"
    return d

def count_items(i)
  if i.nil?
    return 0
    return i.count

n = ARGV[0]
u = "http://friendfeed-api.com/v2/feed/the-life-scientists?start=#{n}"
f = open(u).read
j = JSON.parse(f)

j.each_pair do |k,v|
  if k == "entries"
    v.each do |entry|
      date = format_date(entry['date'])
      likes = count_items(entry['likes'])
      comments = count_items(entry['comments'])
      puts "#{entry['id']},#{date},#{likes},#{comments}"

By default, the API call returns the last 30 items, starting at zero. You can move back in time by running this script with, for example, “tls.rb 30”. Really, there should be a check to see if ARGV[0] is an integer but in fact the argument can be absent (or nothing at all) and it will be ignored. I did say quick and dirty.

The script returns CSV with entry ID, date, time, likes count and comments count, looking like this:


One big drawback of the FriendFeed API is that you cannot retrieve entries by date, or a range of dates. By experimenting with values of “?start=N” in the URL, it seemed that N=3600 retrieved entries from late 2008 onwards. And so:

for i in `seq 0 30 3600`;
  do ./tls.rb $i >> ffdata-raw.csv;

Be aware that this will not retrieve all posts for 2009 and there will also be duplicate entries – which we can filter out by entry ID. To remove duplicates and 2008 entries:

sort -u ffdata-raw.csv | grep ",2009-" > ffdata-filtered.csv

We’re not quite there yet. We have unique records but they can have the same date. We need to sum the counts and likes for each date. Should have done that in the Ruby script really…but we can use awk, to sum the likes, as follows:

awk -F"," '{OFS=",";cnt1[$2]+=$4}END{for (x in cnt1){print x,cnt1[x]}}' ffdata-filtered.csv > ffdata-likes.csv

Just substitute $5 to sum the comments.

Last step: read the file into R, download Paul Bleicher’s calendarHeat.R code and generate plots:

fflikes <- read.csv("ffdata-likes.csv", check.names=F,header=F)
png(filename="tls-likes.png", type="cairo", width=640)
calendarHeat(fflikes$V1, fflikes$V2, varname="Likes",color="r2b")

That was quick, relatively easy and most of all, fun.
In contrast, I’ve been trying to mine microarray data from the NCBI GEO database for the best part of 8 months now.
There’s an API of sorts but getting the results that I want is not quick, easy and most certainly not fun.

Is it any wonder that all the cool kids want to be web developers, not data scientists?

8 thoughts on “APIs: I wish the life sciences would learn from social networks

  1. I agree, the Entrez Utilities, while useful, are often difficult to figure out. I generally find the web services at EBI (http://www.ebi.ac.uk/Tools/webservices/) easier to use.

    I suppose the nature of these services has this effect. The Entrez Utilities are generic and thus the complexity tends to be in figuring out how to make the right query and then filter the response. Instead the EBI services tend to be more specific and thus make it easier to ask the question and filter the result (if required), but the mechanisms use to access the service cannot be applied to other resources.

    Since you are looking at GEO, have you tried the ArrayExpress services (http://www.ebi.ac.uk/microarray/doc/help/programmatic_access.html)?

  2. Pingback: The Life Scientists at FriendFeed: 2009 summary « What You’re Doing Is Rather Desperate

  3. Pingback: APIs are powerful platforms

Comments are closed.