APIs have let me down part 1/2: ArrayExpress

The API – Application Programming Interface – is, in principle, a wonderful thing. You make a request to a server using a URL and back come lovely, structured data, ready to parse and analyse. We’ve begun to demand that all online data sources offer an API and lament the fact that so few online biological databases do so.

Better though, to have no API at all than one which is poorly implemented and leads to frustration? I’m beginning to think so, after recent experiences on both a work project and one of my “fun side projects”. Let’s start with the work project, an attempt to mine a subset of the ArrayExpress microarray database.
Read the rest…

Does your LinkedIn Map say anything useful?

LinkedIn, the “professional” career-oriented social network, is one of those places on the Web where I maintain a profile for visibility. I’m yet to gain any practical value whatsoever from it. That said, I know plenty of people who do find it useful – mostly, it seems, those living near the north-east or west coast of the USA.


My LinkedIn Network

LinkedIn have something of a reputation for innovation – see LinkedIn Labs, their small demonstration products, for example. The latest of these is named InMaps. It’s been popping up on blogs and Twitter for several days. Essentially, it creates a graph of your LinkedIn network, applies some community detection algorithm to cluster the members and displays the results as a pretty, interactive graphic that you can share.

What seems to have captured the imagination is that the graphs indicate communities that are instantly recognisable to the user. There’s mine on the right (click for full-size version). It’s not a large, complex or especially interesting network but when I “eyeballed” it, I was immediately able to classify the three sub-graphs:

  • Orange – mostly people with whom I have worked or currently work, plus a few “random” contacts: note that this group is hardly interconnected at all
  • Green – people who work in bioinformatics or computational biology, particularly genomics: two major hubs connect me with this group
  • Blue – the largest, densest network is composed largely of what I’d call the “BioGang”: people that I interact with on Twitter and FriendFeed, many of whom I haven’t met in person

This confirms what I’ve long suspected: I prefer to network with smart strangers than my immediate peers and colleagues. Or as Bill Joy said, “no matter who you are, most of the smartest people work for someone else.” I’ve seen this misquoted as “where you are”, which makes more sense to me.

Monitoring PubMed retractions: a Heroku-hosted Sinatra application

In a previous post analysing retractions from PubMed, I wrote:

It strikes me that it would be relatively easy to build a web application (Rails, Heroku), which constantly monitors retraction data at PubMed and generates a variety of statistics and charts.

“Relatively easy” it was. Let me introduce you to PMRetract, my first publicly-available web application.
Read the rest…

Dropbox tip continued: convert a file tree to HTML

A couple of posts ago, I outlined a small bash script to generate an index.html file, containing links to other files in a directory. This was for generating links to files in a Dropbox public directory.

I had completely forgotten about the very useful UNIX/Linux command named tree. If not installed, it should be in your distribution repository (e.g. sudo apt-get install tree for Ubuntu/Debian). Then simply:

cd Dropbox/Public/mydirectory
tree -H . > index.html
Next, navigate to index.html at the Dropbox website and you should see something like the tree on the right. It’s a little ugly and obviously, not as convenient as something like Github, but can be a good quick and dirty fix if you need to share a hierarchy of directories and files.

A quick Bash tip: add an index.html file to a Dropbox public folder

You know that Dropbox is terrific, of course. No? Go and check it out now.

One issue: files in your Public folder have a public URL, that you can send to other people. Unfortunately, directories do not. So how do you share a public directory full of files?

Answer: create an index.html file and share that. Let’s say that your files end in “.txt” and reside in ~/Dropbox/Public/entrez. Do this:

cd ~/Dropbox/Public/entrez
echo "<ol>" > index.html
for i in `ls *.txt`; do echo "<li><a href='$i'>$i</a></li>" >> index.html; done
echo "</ol>" >> index.html

Now you can share the link to the index.html, which when clicked will display a list of links to all the other files in the directory.

Nodalpoint: now in glorious stereophonic audio

Nodalpoint Conversations is, in Greg’s words, “Nodalpoint rebooted as a podcast”. Long-time readers will remember Nodalpoint, a bioinformatics community where many of us first came together.

In the “pilot” episode, Greg and I chat (via Skype between Leiden and Sydney) about all things bioinformatics. It was a new experience for me and one which I greatly enjoyed, despite being somewhat unsettled by the sound of my own recorded voice.

Check out Episode 1 of Nodalpoint Conversations. It’s as close to proof of my existence as you’ll get. Then follow nodalconv on Twitter.

GEO database: curation lagging behind submission?


GSE and GDS records in GEOmetadb by date

I was reading an old post that describes GEOmetadb, a downloadable database containing metadata from the GEO database. We had a brief discussion in the comments about the growth in GSE records (user-submitted) versus GDS records (curated datasets) over time. Below, some quick and dirty R code to examine the issue, using the Bioconductor GEOmetadb package and ggplot2. Left, the resulting image – click for larger version.

Is the curation effort keeping up with user submissions? A little difficult to say, since GEOmetadb curation seems to have its own issues: (1) why do GDS records stop in 2008? (2) why do GDS (curated) records begin earlier than GSE (submitted) records?


# update database if required using getSQLiteFile()
# connect to database; assumed to be in user $HOME
con <- dbConnect(SQLite(), "~/GEOmetadb.sqlite")

# fetch "last updated" dates for GDS and GSE
gds <- dbGetQuery(con, "select update_date from gds")
gse <- dbGetQuery(con, "select last_update_date from gse")

# cumulative sums by date; no factor variables
gds.count <- as.data.frame(cumsum(table(gds)), stringsAsFactors = F)
gse.count <- as.data.frame(cumsum(table(gse)), stringsAsFactors = F)

# make GDS and GSE data frames comparable
colnames(gds.count) <- "count"
colnames(gse.count) <- "count"

# row names (dates) to real dates
gds.count$date <- as.POSIXct(rownames(gds.count))
gse.count$date <- as.POSIXct(rownames(gse.count))

# add type for plotting
gds.count$type <- "gds"
gse.count$type <- "gse"

# combine GDS and GSE data frames
gds.gse <- rbind(gds.count, gse.count)

# and plot records over time by type
png(filename = "geometadb.png", width = 800, height = 600)
print(ggplot(gds.gse, aes(date,count)) + geom_line(aes(color = type)))