Experiments with igraph

Networks – social and biological – are all the rage, just now. Indeed, a recent entry at Duncan’s QOTD described the “hairball” network representation as the dominant cultural icon in molecular biology.

I’ve not had occasion to explore networks “professionally”, but have always been fascinated by both networks and the tools used to analyse them. My grasp of graph theory, the mathematics behind networks, is more or less summarised by this Wikipedia page. I’ve also been exploring the igraph library and thought I’d share a few of my “experiments with igraph”. As I say, I’m learning myself as I go along, so none of this should be taken as professional advice.

Let’s start with my favourite network – FriendFeed of course – and ask a few questions about everyone’s favourite group, The Life Scientists (TLS).

My initial thought was: to what extent is TLS a community? In other words, do people who subscribe to TLS also tend to subscribe to one another?

1. Fetching the raw data from FriendFeed
To begin, I used the FriendFeed API to fetch subscribers to TLS and then subscriptions within TLS subscribers. It’s becoming confusing already – roll the Ruby:

#!/usr/bin/ruby

require "open-uri"
require "json/pure"

# get TLS feed info
subscribers = []
# auth = [USER, REMOTE_KEY]
tls  = JSON.parse(open("http://friendfeed-api.com/v2/feedinfo/the-life-scientists", :http_basic_authentication => auth).read)
# only public feeds
tls['subscribers'].each do |sub|
  if sub['private'].nil?
    subscribers << sub['id']
  end
end

# user subscriptions within TLS
puts "There are #{subscribers.length} public feeds."
network = []
subscribers.each do |id|
  puts "Fetching #{id}..."
  feed = JSON.parse(open("http://friendfeed-api.com/v2/feedinfo/#{id}").read)
  network << [id, "the-life-scientists"]
  feed['subscriptions'].each do |s|
    if subscribers.include?(s['id'])
      network << [id, s['id']]
    end
  end
  sleep 2
end

# write out pairs
File.open("network.csv", "w") do |f|
  network.each do |pair|
    f.puts(pair.join(","))
  end
end

Lines 6-15 fetch information about the TLS feed. A few words about privacy. If you uncomment line 8 and substitute your FriendFeed user name and remote key, line 9 will make an authenticated request which returns information for users with private feeds, if you are subscribed to them. However, those people may not want their data exposed to non-subscribers. So, in lines 11-15, we take user names and only store them in an array if their feed is public. The upshot of all that is that of 1 323 users that I can access, we use only 1 128 with public feeds.

Lines 17-30 process each TLS subscriber and their subscriptions, storing the results in the network array. Only subscriptions to users who are also subscribed to TLS are stored (lines 24-28), plus the subscription to TLS (line 23). There’s also a small pause between feeds, just in case we’re hammering the API too hard.

At the end, each user-subscription pair is written to a CSV file, one pair per line. Total = 17 468 lines, some of which look like this:

neilfws,the-life-scientists
neilfws,aroy
neilfws,abhishektiwari
neilfws,adamk
neilfws,adrianh
neilfws,ajc
....

2. Using igraph in R
Igraph can be used in C, Python, Ruby or R – let’s go with R. Reading in the file and converting to an igraph object is straightforward. I’m assuming that the subscriptions graph is directed, in that you subscribing to me has direction you –> me and vice versa, me –> you.

library(igraph)
tls <- read.csv("network.csv", header = F)
g <- graph.data.frame(tls, directed = T)
ecount(g)
[1] 17468
vcount(g)
[1] 1129

Voilà – a graph with 1 129 vertices (nodes) and 17 468 edges (connections). What can we do with that? Well, there are a few simple summary metrics, such as graph diameter:

# network diameter
diameter(g)
[1] 7
# show the farthest nodes
farthest.nodes(g)
[1] 133 928   7

That tells us that the average minimum distance between a pair of vertices in the TLS network is 7 people – no surprise there.

However, the real meat of igraph comes with methods that describe edges, vertices, subgraphs, communities and so on. First example: the largest “clique”. A clique is a maximally-connected subgraph in which every vertex connects to every other vertex.

lc <- largest.cliques(g)
lc
[[1]]
 [1]   87  111  117  120  139  260  293  326  381  484  500  549  663  681  705
[16]  725  738  753  770  775  794  832  834  847  885  962 1073 1120 1128

I know that you want the identity of these individuals, so here’s a quick and dirty plot.

# create a new graph of the largest clique
V(g)$label <- V(g)$name
g.lc <- subgraph(g, lc[[1]])
png(filename = "lc.png")
plot(g.lc, layout=layout.fruchterman.reingold, vertex.color="gray60", vertex.size = 0, edge.arrow.size = 0.5, edge.color = "gray80")
dev.off()

The result is the left figure, below; click on the image for a larger version.
Adding properties to vertices and edges is easy, using the V or E functions, respectively. You can add whatever you like: for example, V(g)$foo <- bar, provided that bar is a vector of length equal to number of vertices. Here’s an example in which we try to find communities in the TLS network using a model from statistical mechanics called spin-glass [1, 2], label the vertices and change vertex colour and size:

sgc <- spinglass.community(g)
V(g)$membership <- sgc$membership
# found 4 communities 0, 1, 2, 3
V(g) [ membership == 0 ]$color <- "cyan"
V(g) [ membership == 1 ]$color <- "green"
V(g) [ membership == 2 ]$color <- "blue"
V(g) [ membership == 3 ]$color <- "red"
V(g)$size <- 4
V(g) [ name == "the-life-scientists" ]$size <- 20
> png(filename = "tls.png", height = 800, width = 800)
> plot(g, layout=layout.fruchterman.reingold, vertex.color=V(g)$color, vertex.size = V(g)$size, vertex.label = NA, edge.arrow.size = 0.5)
> dev.off()

Result at far-right.

tls

Community detection in the TLS network


lcg

TLS largest clique subgraph


The large blue circle in the centre is The Life Scientists group. One community, the blue circles around the edge, is composed of people who subscribe to the group but do not subscribe to many (or any) members of the group. As for the red, cyan and green communities – more investigation is required. I can only tell you that I belong to the “red” community, where I also see many names that I would recognise as (bio)informaticians, computational biologists and active researchers in biology.

I still have plenty to learn about both igraph and network analysis; hopefully, this is a useful initial exploration.

References
1. J. Reichardt and S. Bornholdt (2006)
Statistical Mechanics of Community Detection
Phys. Rev. E (74), 016110

2. M. E. J. Newman and M. Girvan (2004)
Finding and evaluating community structure in networks
Phys. Rev. E (69), 026113

9 thoughts on “Experiments with igraph

  1. Great post, thank you :)
    Did you consider also parsing the HTML with R?
    I know there are packages to allow that.

    Best,
    Tal

    • No – the FriendFeed data comes in JSON format from the API, so no need for nasty HTML scraping. R has a couple of packages for JSON (see previous post) but they’re not so easy to work with – using Ruby is much easier.

      • Thanks,

        So do you think there is a need for some RRuby package, to make the best of the two worlds ??

        Tal

  2. Glad you asked – there is an R wrapper for Ruby called RSRuby. It (usually) works quite well, but I’m happy to get data into CSV using one language, then switch to R for the rest.

  3. Funny you show the largest clique! My immediate response to check I am on it… sure I’m not, despite the large overlap. But now I have some new people I could follow… but wait… then I realized that the power of FriendFeed is that I *do not* have to subscribe to everyone… I select the unique people (sorry if you are not one of them :), and have them filter out the best news… via likes and comments all the good comes my way…

    So, what is, in your opinion, the importance of cliques on FF?

    It would perhaps be more interesting to discover the ‘hubs’, as they do in metabolomics… identify the FF accounts that link the most input with the most output… that is, count(follows(hub)) + count(followed_by(hub)) > threshold. These hubs may match the unique accounts…

    • Yes, hubs are interesting. And potentially, many other statistics – I’m still working through the igraph methods and this post is only an overview. A clique is simply a subset of highly-connected nodes; more than that is difficult to say.

  4. This is quite possibly the most straightforward tutorial for igraph I’ve found. Maybe it’s just because I’ve read (and been confused by) so many that I finally reached a critical mass for an epiphany, but I’ve gone from this page to results in about three minutes. Thank you.

    • You’re welcome – I’m glad it helped. I’m new to igraph myself, so hopefully I haven’t passed my mistakes on to you ;-)

Comments are closed.