Category Archives: web resources

Web scraping using Mechanize: PMID to PMCID/NIHMSID

Web services are great. Pass them a URL. Structured data comes back. Parse it, analyse it, visualise it. Done.

Web scraping – interacting programmatically with a web page – is not so great. It requires more code and when the web page changes, the code breaks. However, in the absence of a web service, scraping is better than nothing. It can even be rather satisfying. Early in my bioinformatics career the realisation that code, rather than humans, can automate the process of submitting forms and reading the results was quite a revelation.

In this post: how to interact with a web page at the NCBI using the Mechanize library.

Read the rest…

The end of Google Reader: a scientist’s perspective

Since 2005, I have started almost every working day by using one Web application – an application that occupies a permanent browser tab on my work and home desktop machines. That application is Google Reader.

If you’re reading this, you’re probably aware that Google Reader will cease to exist from July 1 2013. Others have ranted, railed against the corporate machine and expressed their sadness. I thought I’d try to explain why, for this working scientist at least, RSS and feed readers are incredibly useful tools which I think should be valued highly.

Read the rest…

A Friday round-up

Just a brief selection of items that caught my eye this week. Note that this is a Friday as opposed to Friday, lest you mistake this for a new, regular feature.

1. R/statistics

  • ggbio
  • A new Bioconductor package which builds on the excellent ggplot graphics library, for the visualization of biological data.

  • R development master class
  • Hadley Wickham recently presented this course on R package development for my organisation. I was on parental leave at the time, otherwise I would have attended for sure.

2. Bioinformatics in the media
DNA Sequencing Caught in Deluge of Data

I described this NYT article as a “surprisingly-good intro article“. Michael Eisen described it as “kind of silly“.

I think we’re both right. Michael’s perspective is that of an expert in high-throughput sequencing data; I’m just pleased to see an introduction to bioinformatics for non-specialists in a mainstream newspaper. And I note that they have corrected the figure caption which offended Michael.

As to the “deluge”: yes, there are other sciences that generate more data and yes, we probably don’t need to archive/analyse a lot of the raw data. However, I’d contend that the basic premise of the article is correct: we are sequencing faster than we can analyse. The solution, obviously, is more bioinformaticians.

Popular topics at the BioStar Q&A site

Which topics are the most popular at the BioStar bioinformatics Q&A site?

One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.

OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.

Read the rest…

ISMB coverage on Twitter? It’s possible there was…

Peter writes:

I wonder if part of the drop off is live bloggers moving to platforms like Twitter? I can tell you it seemed like there were almost as many tweets for one SIG (#bosc2011) as for the whole of #ISMB / #ECCB2011, and I personally didn’t post anything to FriendFeed but posted lots on Twitter.

Well, there’s a problem with using Twitter for analysis of conference coverage. Let’s try searching for ISMB-related tweets using the twitteR package:

library(twitteR)
ismb <- searchTwitter("ismb", 1000)
length(ismb)
# [1] 30

oldertweets

If we can't archive, how can anyone else?

30? Are we using twitteR properly? Running the same search at the Twitter website gives roughly the same results, plus this unhelpful message.

I like Twitter – as a real-time communication tool. As a data archive? Forget it.

Real bioinformaticians write code

A lot of questions at BioStar begin along these lines:

Where can I find…?
I am looking for a resource…?
Is there some database…?

I tweeted some concerns about this:

Many #biostar questions begin “I am looking for a resource..”. The answer is often that you need to code a solution using the data you have.

Chris tweeted back:

@neilfws Lit. or Google search is first step, asking around is the next logical step. (Re-)inventing wheels is last. Worth asking, IMHO.

We had a little chat and I realised that 140 characters or less was not getting my point across (not for the first time). What I was trying to say was something like this.
Read the rest…

Farewell FriendFeed. It’s been fun.

I’ve been a strong proponent of FriendFeed since its launch. Its technology, clean interface and “data first, then conversations” approach have made it a highly-successful experiment in social networking for scientists (and other groups). So you may be surprised to hear that from today, I will no longer be importing items into FriendFeed, or participating in the conversations at other feeds.

Here’s a brief explanation and some thoughts on my online activity in the coming months.
Read the rest…

APIs have let me down part 2/2: FriendFeed

In part 1, I described some frustrations arising out of a work project, using the Array Express API. I find that one way to deal mentally with these situations is to spend some time on a fun project, using similar programming techniques. A potential downside of this approach is that if your fun project goes bad, you’re really frustrated. That’s when it’s time to abandon the digital world, go outside and enjoy nature.

Here then, is why I decided to build another small project around FriendFeed, how its failure has led me to question the value of FriendFeed for the first time and why my time as a FriendFeed user might be up.
Read the rest…

APIs have let me down part 1/2: ArrayExpress

The API – Application Programming Interface – is, in principle, a wonderful thing. You make a request to a server using a URL and back come lovely, structured data, ready to parse and analyse. We’ve begun to demand that all online data sources offer an API and lament the fact that so few online biological databases do so.

Better though, to have no API at all than one which is poorly implemented and leads to frustration? I’m beginning to think so, after recent experiences on both a work project and one of my “fun side projects”. Let’s start with the work project, an attempt to mine a subset of the ArrayExpress microarray database.
Read the rest…

Does your LinkedIn Map say anything useful?

LinkedIn, the “professional” career-oriented social network, is one of those places on the Web where I maintain a profile for visibility. I’m yet to gain any practical value whatsoever from it. That said, I know plenty of people who do find it useful – mostly, it seems, those living near the north-east or west coast of the USA.

inmap

My LinkedIn Network


LinkedIn have something of a reputation for innovation – see LinkedIn Labs, their small demonstration products, for example. The latest of these is named InMaps. It’s been popping up on blogs and Twitter for several days. Essentially, it creates a graph of your LinkedIn network, applies some community detection algorithm to cluster the members and displays the results as a pretty, interactive graphic that you can share.

What seems to have captured the imagination is that the graphs indicate communities that are instantly recognisable to the user. There’s mine on the right (click for full-size version). It’s not a large, complex or especially interesting network but when I “eyeballed” it, I was immediately able to classify the three sub-graphs:

  • Orange – mostly people with whom I have worked or currently work, plus a few “random” contacts: note that this group is hardly interconnected at all
  • Green – people who work in bioinformatics or computational biology, particularly genomics: two major hubs connect me with this group
  • Blue – the largest, densest network is composed largely of what I’d call the “BioGang”: people that I interact with on Twitter and FriendFeed, many of whom I haven’t met in person

This confirms what I’ve long suspected: I prefer to network with smart strangers than my immediate peers and colleagues. Or as Bill Joy said, “no matter who you are, most of the smartest people work for someone else.” I’ve seen this misquoted as “where you are”, which makes more sense to me.