Can someone please plot the BioStar users on a Google Map?
Sounds like a challenge. Let’s go.
1. Harvesting user IP addresses
BioStar user profiles (here’s mine) include a location field. It’s free text and optional, which means that location is missing or inaccurate for many users. However, if you’re logged into BioStar (and perhaps, if you’re a moderator – I’m not sure), you’ll see a field that says:
Last activity: 4 hours ago from XXX.XXX.XXX.XXX
where “XXX.XXX.XXX.XXX” is either an IP address or, for your own page, the text “this IP address” (assuming your latest activity was from your current machine).
IP addresses can be used for geolocation – we’ll see how shortly. The problem is that they are only present when logged into BioStar, which uses OpenID for authentication. So to write code which automates the collection of user IP addresses, you’d have to convince BioStar that you were logged in.
I’m sure that it’s possible to write code which stores OAuth credentials and sends them to BioStar, but it would take some time to develop. So instead, I used a very ugly and largely manual approach. First, I wrote this simple Greasemonkey script:
// ==UserScript== // @name BioStar IP // @namespace http://twitter.com/neilfws // @description Get user IP // @include http://biostar.stackexchange.com/users/* // ==/UserScript== var d; d = document.evaluate("//div[@class='summaryinfo']", document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null); console.log(d.snapshotItem(0).innerHTML);
Last activity: <span title="2010-10-03 23:06:52Z UTC" class="relativetime">Oct 3 at 23:06</span> from XXX.XXX.XXX.XXX
Again, XXX.XXX.XXX.XXX is the IP address.
So I opened Firefox, installed the Greasemonkey and Firebug extensions, installed my user script, navigated to the BioStar users page, opened the Firebug console and started clicking through users. By choosing “Persist” and increasing the console log limit, I was able to record the IP address of each user in the console. When finished, I copied the console contents to a text file.
There is no worse solution, for a bioinformatician, than one that involves manual labour, copy and paste. Currently, there are 17 pages of users (16 x 35 + 1 x 11 = 571 total). My file contains 567 of them: at least one did not display an IP address and perhaps I missed a couple. This is why we learn to script.
2. Location using GeoIP
So how do we find location using IP? The answer is GeoIP.
First, head over to the MaxMind website and download their GeoIP C API. I installed it (for Ubuntu) like so:
wget http://geolite.maxmind.com/download/geoip/api/c/GeoIP.tar.gz tar zxvf GeoIP.tar.gz cd GeoIP-1.4.6 ./configure --prefix=/opt/GeoIP make sudo make install # install the city database wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz gunzip GeoLiteCity.dat.gz sudo mv GeoLiteCity.dat /opt/GeoIP/share/GeoIP/
GeoIP comes with a free database of countries, located in /opt/GeoIP/share/GeoIP/GeoIP.dat. I also installed their free city database, as shown above.
Next, the Ruby gem for GeoIP:
[sudo] gem install mtodd-geoip -s http://gems.github.com/ -- --with-geoip-dir=/opt/GeoIP
Now, quick and very dirty Ruby code to read the text file containing IP addresses and look them up in the GeoIP database:
require "rubygems" require "geoip" ip = "ip.txt" # the text file containing IPs, copied from console.log db = GeoIP::City.new("/opt/GeoIP/share/GeoIP/GeoLiteCity.dat") File.read(ip).each do |line| line.chomp if line =~/from\s+(\d+\.\d+\.\d+\.\d+)/ locn =  lookup = db.look_up($1) locn.push(lookup[:country_name], lookup[:country_code], lookup[:city], lookup[:latitude], lookup[:longitude]) puts locn.join("\t") end end
That prints out a tab-delimited file, which looks like this:
United States US East Lansing 42.7282981872559 -84.4881973266602 Italy IT Rome 41.9000015258789 12.4833002090454 Portugal PT Fafe 41.4500007629395 -8.16670036315918 China CN Wuhan 30.5832996368408 114.266700744629 United States US Oklahoma City 35.4715003967285 -97.5189971923828 ...
3. Plotting maps using R
Before we go all Google-y, let’s look at plotting geographical data using R. There are many libraries and mapping solutions, but here’s a simple script to plot our users on a world map. It requires the packages ggplot2 and maps. Assuming that the output from the Ruby script is saved in a file, biostar.tab:
library(ggplot2) library(maps) biostar <- read.table("biostar.tab", header = F, stringsAsFactors = F, sep = "\t") colnames(biostar) <- c("country", "code", "city", "lat", "long") world <- map_data("world") png(file = "biostar.png", width = 1024, height = 768) print(ggplot(world, aes(long, lat)) + geom_polygon(aes(group = group), fill = "darkslategrey") + geom_point(data = biostar, aes(long, lat), colour = "red")) dev.off()
|And here’s the result (click for the full-size version).|
4. Plotting on a Google Map
There are many options for getting data into Google Maps. I figured that there must be a site where you can upload a simple CSV file containing latitude + longitude and display a Google Map. There is – it’s called ZeeMaps. It has many features – some free, some paid – which I’m yet to investigate fully.
|For CSV upload your file requires a column headed “Name” (I chose the city in my file), plus columns of coordinates headed “Latitude” and “Longitude”. All you need to do is create a new map, upload the file and select “refresh”. Here’s the map that I created. Unfortunately, it cannot be embedded in this blog post (click image, right, for a full-size screenshot). I have no idea if that link is permanent and I suspect that anyone can make alterations to the map.|
Of course, IPs can be spoofed, users move around and the location of a machine might not reflect the location of the user. However, I think it’s a more reliable geolocation approach than an arbitrary text description. Now, if I could just automate that IP-harvesting code…