Back in 2010, I wrote a web application called PMRetract to monitor retraction notices in the PubMed database. It was written primarily as a way for me to explore some technologies: the Ruby web framework Sinatra, MongoDB (hosted at MongoHQ, now Compose) and Heroku, where the app was hosted.
I automated the update process using Rake and the whole thing ran pretty smoothly, in a “set and forget” kind of way for four years or so. However, the first era of PMRetract is over. Heroku have shut down git pushes to their “Bamboo Stack” – which runs applications using Ruby version 1.8.7 – and will shut down the stack on June 16 2015. Currently, I don’t have the time either to update my code for a newer Ruby version or to figure out the (frankly, near-unintelligible) instructions for migration to the newer Cedar stack.
So I figured now was a good time to learn some new skills, deal with a few issues and relaunch PMRetract as something easier to maintain and more portable. Here it is. As all the code is “out there” for viewing, I’ll just add few notes here regarding this latest incarnation.
Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.
I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.
2014-09-26 note: when Wikipedia pages change, as this one has, code breaks, as this code has; updates maintained at Github
Web services are great. Pass them a URL. Structured data comes back. Parse it, analyse it, visualise it. Done.
Web scraping – interacting programmatically with a web page – is not so great. It requires more code and when the web page changes, the code breaks. However, in the absence of a web service, scraping is better than nothing. It can even be rather satisfying. Early in my bioinformatics career the realisation that code, rather than humans, can automate the process of submitting forms and reading the results was quite a revelation.
In this post: how to interact with a web page at the NCBI using the Mechanize library.
Read the rest…
Since 2005, I have started almost every working day by using one Web application – an application that occupies a permanent browser tab on my work and home desktop machines. That application is Google Reader.
If you’re reading this, you’re probably aware that Google Reader will cease to exist from July 1 2013. Others have ranted, railed against the corporate machine and expressed their sadness. I thought I’d try to explain why, for this working scientist at least, RSS and feed readers are incredibly useful tools which I think should be valued highly.
Read the rest…
Just a brief selection of items that caught my eye this week. Note that this is a Friday as opposed to Friday, lest you mistake this for a new, regular feature.
A new Bioconductor package which builds on the excellent ggplot graphics library, for the visualization of biological data.
- R development master class
Hadley Wickham recently presented this course on R package development for my organisation. I was on parental leave at the time, otherwise I would have attended for sure.
2. Bioinformatics in the media
DNA Sequencing Caught in Deluge of Data
I described this NYT article as a “surprisingly-good intro article“. Michael Eisen described it as “kind of silly“.
I think we’re both right. Michael’s perspective is that of an expert in high-throughput sequencing data; I’m just pleased to see an introduction to bioinformatics for non-specialists in a mainstream newspaper. And I note that they have corrected the figure caption which offended Michael.
As to the “deluge”: yes, there are other sciences that generate more data and yes, we probably don’t need to archive/analyse a lot of the raw data. However, I’d contend that the basic premise of the article is correct: we are sequencing faster than we can analyse. The solution, obviously, is more bioinformaticians.
Which topics are the most popular at the BioStar bioinformatics Q&A site?
One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then tagging it with “how, to, find, snps” won’t win you any admirers.
OK: we’re going to grab the tags then use a bunch of R packages (XML, wordcloud and ggplot2) to take a quick look.
Read the rest…
I wonder if part of the drop off is live bloggers moving to platforms like Twitter? I can tell you it seemed like there were almost as many tweets for one SIG (#bosc2011) as for the whole of #ISMB / #ECCB2011, and I personally didn’t post anything to FriendFeed but posted lots on Twitter.
Well, there’s a problem with using Twitter for analysis of conference coverage. Let’s try searching for ISMB-related tweets using the twitteR package:
ismb <- searchTwitter("ismb", 1000)
#  30
If we can't archive, how can anyone else?
30? Are we using twitteR properly? Running the same search at the Twitter website gives roughly the same results, plus this unhelpful message.
I like Twitter – as a real-time communication tool. As a data archive? Forget it.
A lot of questions at BioStar begin along these lines:
Where can I find…?
I am looking for a resource…?
Is there some database…?
I tweeted some concerns about this:
Many #biostar questions begin “I am looking for a resource..”. The answer is often that you need to code a solution using the data you have.
Chris tweeted back:
@neilfws Lit. or Google search is first step, asking around is the next logical step. (Re-)inventing wheels is last. Worth asking, IMHO.
We had a little chat and I realised that 140 characters or less was not getting my point across (not for the first time). What I was trying to say was something like this.
Read the rest…
I’ve been a strong proponent of FriendFeed since its launch. Its technology, clean interface and “data first, then conversations” approach have made it a highly-successful experiment in social networking for scientists (and other groups). So you may be surprised to hear that from today, I will no longer be importing items into FriendFeed, or participating in the conversations at other feeds.
Here’s a brief explanation and some thoughts on my online activity in the coming months.
Read the rest…
In part 1, I described some frustrations arising out of a work project, using the Array Express API. I find that one way to deal mentally with these situations is to spend some time on a fun project, using similar programming techniques. A potential downside of this approach is that if your fun project goes bad, you’re really frustrated. That’s when it’s time to abandon the digital world, go outside and enjoy nature.
Here then, is why I decided to build another small project around FriendFeed, how its failure has led me to question the value of FriendFeed for the first time and why my time as a FriendFeed user might be up.
Read the rest…