Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.
I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.
Web services are great. Pass them a URL. Structured data comes back. Parse it, analyse it, visualise it. Done.
Web scraping – interacting programmatically with a web page – is not so great. It requires more code and when the web page changes, the code breaks. However, in the absence of a web service, scraping is better than nothing. It can even be rather satisfying. Early in my bioinformatics career the realisation that code, rather than humans, can automate the process of submitting forms and reading the results was quite a revelation.
In this post: how to interact with a web page at the NCBI using the Mechanize library.
Since 2005, I have started almost every working day by using one Web application – an application that occupies a permanent browser tab on my work and home desktop machines. That application is Google Reader.
If you’re reading this, you’re probably aware that Google Reader will cease to exist from July 1 2013. Others have ranted, railed against the corporate machine and expressed their sadness. I thought I’d try to explain why, for this working scientist at least, RSS and feed readers are incredibly useful tools which I think should be valued highly.
I think we’re both right. Michael’s perspective is that of an expert in high-throughput sequencing data; I’m just pleased to see an introduction to bioinformatics for non-specialists in a mainstream newspaper. And I note that they have corrected the figure caption which offended Michael.
As to the “deluge”: yes, there are other sciences that generate more data and yes, we probably don’t need to archive/analyse a lot of the raw data. However, I’d contend that the basic premise of the article is correct: we are sequencing faster than we can analyse. The solution, obviously, is more bioinformaticians.
Which topics are the most popular at the BioStar bioinformatics Q&A site?
One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.
OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.
I wonder if part of the drop off is live bloggers moving to platforms like Twitter? I can tell you it seemed like there were almost as many tweets for one SIG (#bosc2011) as for the whole of #ISMB / #ECCB2011, and I personally didn’t post anything to FriendFeed but posted lots on Twitter.
Well, there’s a problem with using Twitter for analysis of conference coverage. Let’s try searching for ISMB-related tweets using the twitteR package:
I’ve been a strong proponent of FriendFeed since its launch. Its technology, clean interface and “data first, then conversations” approach have made it a highly-successful experiment in social networking for scientists (and other groups). So you may be surprised to hear that from today, I will no longer be importing items into FriendFeed, or participating in the conversations at other feeds.
Here’s a brief explanation and some thoughts on my online activity in the coming months. Read the rest…
In part 1, I described some frustrations arising out of a work project, using the Array Express API. I find that one way to deal mentally with these situations is to spend some time on a fun project, using similar programming techniques. A potential downside of this approach is that if your fun project goes bad, you’re really frustrated. That’s when it’s time to abandon the digital world, go outside and enjoy nature.
Here then, is why I decided to build another small project around FriendFeed, how its failure has led me to question the value of FriendFeed for the first time and why my time as a FriendFeed user might be up. Read the rest…
The API – Application Programming Interface – is, in principle, a wonderful thing. You make a request to a server using a URL and back come lovely, structured data, ready to parse and analyse. We’ve begun to demand that all online data sources offer an API and lament the fact that so few online biological databases do so.
Better though, to have no API at all than one which is poorly implemented and leads to frustration? I’m beginning to think so, after recent experiences on both a work project and one of my “fun side projects”. Let’s start with the work project, an attempt to mine a subset of the ArrayExpress microarray database. Read the rest…