Category Archives: web resources

I’d be more than happy with the unlinked data web

Visit this URL and you’ll find a perfectly-formatted CSV file containing information about recent earthquakes. A nice feature of R is the ability to slurp such a URL straight into a data frame:

quakes <- read.csv("http://neic.usgs.gov/neis/gis/qed.asc", header = T)
colnames(quakes)
# [1] "Date"      "TimeUTC"   "Latitude"  "Longitude" "Magnitude" "Depth"
# number of recent quakes
nrow(quakes)
# [1] 3135
# biggest recent quake
subset(quakes, quakes$Magnitude == max(quakes$Magnitude, na.rm = T))
#            Date    TimeUTC Latitude Longitude Magnitude Depth
# 2060 2010/02/27 06:34:14.0  -35.993   -72.828       8.8    35

I hear a lot about the “web of data” and the “linked data web” but honestly, I’ll be happy the day people start posting data as delimited, plain text instead of HTML and PDF files.

PhosphoGRID

I no longer work on protein kinases but when I did, PhosphoGRID is the kind of database that I would have wanted to see. It features:

  • A nice clean interface, with good use of Javascript
  • Useful information returned from a simple search form
  • Data for download in plain text format with no restrictions or requirements for registration

All it lacks is a RESTful API, but nothing is perfect :-)

Published in the little-known but often-useful journal Database:

PhosphoGRID: a database of experimentally verified in vivo protein phosphorylation sites from the budding yeast Saccharomyces cerevisiae.
doi:10.1093/database/bap026.

How to: archive data via an API using Ruby and MongoDB

I was going to title this post “How to: archive a FriendFeed feed in MongoDB”. The example code does just that but (a) I fear that this blog suggests a near-obsession with FriendFeed (see tag cloud, right sidebar) and (b) the principles apply to any API that returns JSON. There are rare examples of biological data with JSON output in the wild, e.g. the ArrayExpress Gene Expression Atlas. So I’m still writing a bioinformatics blog ;-)

Let’s go straight to the code:

#!/usr/bin/ruby

require "rubygems"
require "mongo"
require "json/pure"
require "open-uri"

# db config
db  = Mongo::Connection.new.db('friendfeed')
col = db.collection('lifesci')

# fetch json
0.step(9900, 100) {|n|
  f = open("http://friendfeed-api.com/v2/feed/the-life-scientists?start=#{n}&amp;num=100").read
  j = JSON.parse(f)
  break if j['entries'].count == 0
  j['entries'].each do |entry|
    if col.find({:_id =&gt; entry['id']}).count == 0
      entry[:_id] = entry['id']
      entry.delete('id')
      col.save(entry)
    end
  end
  puts "Processed entries #{n} - #{n + 99}", "Database contains #{col.count} documents."
}

puts "No more entries to process. Database contains #{col.count} documents."

Also available as a gist. Fork away.

A quick run-through. Lines 4-6 load the required libraries: mongo (the mongodb ruby driver), json and open-uri. If you don’t have the first two, simply “gem install mongo json_pure”. Of course, you’ll need to download MongoDB and have the mongod server daemon running on your system.

Lines 9-10 connect to the database (assuming a standard database installation). Rename the database and collection as you see fit. Both will be created if they don’t exist.

The guts are lines 12-25. A loop fetches JSON from the FriendFeed API, 100 entries at a time (0-99, 100-199…) up to 9999. That’s an arbitrarily-high number, to ensure that all entries are retrieved. Change “the-life-scientists” in line 14 to the feed of your choice. The JSON is then parsed into a hash structure. In lines 17-23 we loop through each entry and extract the “id” key, a unique identifier for the entry. This is used to create the “_id” field, a unique identifier for the MongoDB document. If a document with _id == id does not exist we create an _id key in the hash, delete the (now superfluous) ‘id’ key and save the document. Otherwise, the entry is skipped.
At some point the API will return no more entries: { “entries” : [] }. When this happens, we exit the block (line 16) and print a summary.

That’s it, more or less. Obviously, the script would benefit from some error checking and more options (such as supplying a feed URL as a command line option). For entries with attached files, the file URL but not the attachment will be saved. A nice improvement would be to fetch the attachment and save it to the database, using GridFS.

Possible uses: a simple archive, a backend for a web application to analyse the feed.

Reblog this post [with Zemanta]

The Life Scientists at FriendFeed: 2009 summary

The Life Scientists

The Life Scientists 2009


It’s Christmas Eve tomorrow and so I declare the year over. My Christmas gift to you is a summary of activity in 2009 at the FriendFeed Life Scientists group. It’s crafted using R + Ruby, with raw data and some code snippets available. If you want to see the most popular items from the group this year, head down to the bottom of this post.

(Note: this post is a work in progress)
Read the rest…

A brief survey of R web interfaces

I’m looking at ways to provide access to R via a web application. First rule: see what’s available first, before you reinvent the wheel. It’s not pretty.

From the R Web Interfaces FAQ:

Software Brief notes
Rweb Page last updated 1999. Of the 3 example links on the page one ran very slowly, the second not at all and the third is broken.
R-Online Or rather, not online. Unless this CGI form is the same thing. I tried Example 1, it returned a server error.
Rcgi Links to several CGI forms, none of which worked for me.
CGI-based R access Link did not load.
CGIwithR Package now maintained at Omegahat. Did not attempt installation. Last updated 2005.
Rpad I could not connect to this URL.
RApache The pick of the bunch. Provides server-side access to R through an Apache module. I was able to install RApache on 32-bit (but not 64-bit) Ubuntu 9.10 and get it running. Could use more documentation.
Rserve Serves R via TCP/IP. Last updated 2006.
OpenStatServer Broken link. No longer exists, so far as I can tell.
R PHP Online Link out of date (but you can follow it to the newer page). Last updated 2003, so unlikely to be much use.
R-php Last updated 2006; the example that I tried gave a server error.
webbioc A Bioconductor package. Did not investigate further.
Rwui An application to create R web interfaces. My browser hung at “waiting for cache”. I gave up.

So, aside from RApache and some very old-fashioned and/or broken CGI scripts, I conclude that there is little interest in writing beautiful, modern statistical web applications (notable exception). Not so much a case of “reinventing” as “inventing”.

Turn Emacs into an IDE

Update: I should have said Rails IDE – but I’m sure similar plugins are available for other languages

I fired up NetBeans at work today, tried to open a Rails project and – inexplicably, it crashed. All is well at home, so I’m blaming work machine setup issues as-yet unknown (but I suspect, involving the letters “ATI”).

It got me thinking that, as much as I like NetBeans, it is still just a memory-eating, CPU-hogging, bloated Java-based GUI. For some time I’ve wanted to convert my favourite editor, Emacs, to something more like an IDE.

emacs23_rails_ide

It's Emacs, but not as we know it


The WyeWorks Blog to the rescue. Install emacs-23 and a couple of Ruby gems, clone their github repository of Emacs plugins, copy to your ~/.emacs.d/ and voilĂ  – marvel at your new, shiny editing environment. I also replaced my ~/.emacs with their init.el file.

The key plugins include ECB, textmate.el, Rinari and yasnippet, plus a bunch of modes for syntax highlighting. If you’ve only tried cursory Emacs customisation in the past the results are a little alarming at first, but you’ll be back to coding (and saying “Ooh! Aah!”) in no time at all.

FriendFeed Life Scientists: 14-day summary

Since I haven’t posted for 14 days, what better (and lazier) way to post something than to surf over to a 14-day summary from the Life Scientists Group and link to the top ten items!

  1. Review process files in the EMBO Journal – but why only for “the majority of papers”?
  2. How XML threatens Big Data. Or not. How JSON might be an alternative – or not.
  3. Solve any computer problem – with this classic XKCD flowchart.
  4. Science reviews the revolution in ‘strategic scientific reading’ – are they way behind the curve, or providing a useful summary for the uninitiated?
  5. Best practice in microbial genome annotation – spirited discussion on the nature of best bioinformatics practice.
  6. FriendFeed Life Scientists user survey – no further word on whether this will happen.
  7. 50 Years of Structure – link to a JMB review on the early days of structural biology.
  8. Reflections on Science Online London 2009
  9. Workflow tools that speak SOAP?
  10. Advice on cleaning up a protein sample – a nice example of useful discussion from the group.

Who knows, this could become a semi-regular feature.