Last week, Mick Watson posted a terrific article on using R to recreate the visualizations in this WSJ article on the impact of vaccination. Someone beat me to the obvious joke.
Someone also beat me to the standard response whenever base R graphics are used.
And despite devoting much of Friday morning to it, I was beaten to publication of a version using ggplot2.
Why then would I even bother to write this post. Well, because I did things a little differently; diversity of opinion and illustration of alternative approaches are good. And because on the internet, it’s quite acceptable to appropriate great ideas from other people when you lack any inspiration yourself. And because I devoted much of Friday morning to it.
Here then is my “exploration of what Mick did already, only using ggplot2 like Ben did already.”
I’ve been meaning to write about Entrez Direct, henceforth called edirect, for some time. This tweet provided me with an excuse:
This post is not strictly the answer to that question. Instead we’ll ask: which parent IDs of records for insects in the NCBI Taxonomy database have the most species IDs?
I was asked recently to look at some R code which performs “embarrassingly parallel” computations (the same function, multiple times, different parameters) and see whether I could modify it to run on one of our high-performance computing clusters. The machine has 63 virtual compute nodes and uses the TORQUE batch queue system to allocate nodes to compute jobs.
First stop: the CRAN Task View High-Performance and Parallel Computing with R. Two promising packages there: BatchJobs and BatchExperiments. Their documentation is quite extensive with useful examples, but I found it a little disjointed and confusing. What I wanted was a simple, step-by-step guide to setting up for a first-time user. So here is my attempt. As always, it’s for “Linux-like” systems.
I guess I’ve been around bioinformatics for the best part of 15 years. In that time, I’ve seen almost no improvement in the way biologists handle and use data. If anything I’ve seen a decline, perhaps because the data have become larger and more complex with no improvement in the skills base.
It strikes me when I read questions at Biostars that the problem faced by many students and researchers is deeper than “not knowing what to do.” It’s having no idea how to figure out what they need to know in order to do what they want to do. In essence, this is about how to get people into a problem-solving mindset so as they’re aware, for example that:
- it’s extremely unlikely that you are the first person to encounter this problem
- it’s likely that the solution is documented somewhere
- effective search will lead you to a solution even if you don’t fully understand it at first
- the tool(s) that you know are not necessarily the right ones for the job (and Excel is never the right tool for the job)
- implementing the solution may require that you (shudder) learn new skills
- time spent on those skills now is almost certainly time saved later because…
- …with a very little self-education in programming, tasks that took hours or days can be automated and take seconds or minutes
It’s good (and bad) to know that these issues are not confined to Australian researchers: here is It’s time to reboot bioinformatics education by Todd Harris. It is excellent and you should go and read it as soon as possible.
Just a quick update to the previous post. At the helpful suggestion of Steve Royle, I’ve added a new section to the report which attempts to normalise retractions by journal. So for example, J. Biol. Chem. has (as of now) 94 retracted articles and in total 170 842 publications indexed in PubMed. That becomes (100 000 / 170 842) * 94 = 55.022 retractions per 100 000 articles.
Top 20 journals, retracted articles per 100 000 publications
This leads to some startling changes to the journals “top 20″ list. If you’re wondering what’s going on in the world of anaesthesiology, look no further
(thanks again to Steve for the reminder).
Back in 2010, I wrote a web application called PMRetract to monitor retraction notices in the PubMed database. It was written primarily as a way for me to explore some technologies: the Ruby web framework Sinatra, MongoDB (hosted at MongoHQ, now Compose) and Heroku, where the app was hosted.
I automated the update process using Rake and the whole thing ran pretty smoothly, in a “set and forget” kind of way for four years or so. However, the first era of PMRetract is over. Heroku have shut down git pushes to their “Bamboo Stack” – which runs applications using Ruby version 1.8.7 – and will shut down the stack on June 16 2015. Currently, I don’t have the time either to update my code for a newer Ruby version or to figure out the (frankly, near-unintelligible) instructions for migration to the newer Cedar stack.
So I figured now was a good time to learn some new skills, deal with a few issues and relaunch PMRetract as something easier to maintain and more portable. Here it is. As all the code is “out there” for viewing, I’ll just add few notes here regarding this latest incarnation.
I am forever returning to PubMed data, downloaded as XML, trying to extract information from it and becoming deeply confused in the process.
Take the seemingly-simple question “how many retracted articles are there in PubMed?”
The blog post in question concerns conversion of PubMed PMIDs to BibTeX citations. However, a few things have changed since 2010.
Here’s what currently works.
PeerJ, like PLoS ONE, aims to publish work on the basis of “soundness” (scientific and methodological) as opposed to subjective notions of impact, interest or significance. I’d argue that effective, appropriate data visualisation is a good measure of methodology. I’d also argue that on that basis, Evolution of a research field – a micro (RNA) example fails the soundness test.
There was a time, around 2009 or so, when almost every post at this blog was tagged “friendfeed”. So with the announcement (which frankly I expected 5 years ago) that it is to be shut down, I guess a few words are in order.
I’m thankful to FriendFeed for facilitating many of my current online friendships. It was uniquely successful in creating communities composed of people with an interest in how to do science online, not just talk about (i.e. communicate) science online. It was justly famous for bringing together research scientists with other communities: librarians in particular, people from the “tech world”, patient advocates, educators – all under the umbrella of a common interest in “open science”. We even got a publication or two out of it.
To this day I am not sure why it worked so well. One key feature was that it allowed people to coalesce around pieces of information. In contrast to other networks it was the information, presented via a sparse, functional interface, that initially brought people together, as opposed to the user profile. There was probably also a strong element of “right people in the right place at the right time.”
It’s touching that people are name-checking me on Twitter regarding the news of the shutdown, given that no trace of my FriendFeed activity remains online. Realising that my activity was getting more and more difficult to retrieve for archiving and that bugs were never going to be fixed, I opted several years ago to delete my account. The loss of my content pains me to this day, but inaccurate public representation of my activities due to poor technical implementation pains me more.
I’ve seen a few reactions along the lines of “what is all the fuss about.” How short is our collective memory. To those people: look at Facebook, Yammer or even Twitter and ask yourself where the idea of a stream of items with associated discussion came from.
Farewell then FriendFeed, pioneer tool of the online open science community. We never did find a tool quite as good as you.