Before we start: yes, we’ve been here before. There was the Biostars question “Calculating Time From Submission To Publication / Degree Of Burden In Submitting A Paper.” That gave rise to Pierre’s excellent blog post and code + data on Figshare.
So why are we here again? 1. It’s been a couple of years. 2. This is the R (+ Ruby) version. 3. It’s always worth highlighting how the poor state of publicly-available data prevents us from doing what we’d like to do. In this case the interesting question “which bioinformatics journal should I submit to for rapid publication?” becomes “here’s an incomplete analysis using questionable data regarding publication dates.”
Let’s get it out of the way then.
File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”
Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.
es <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP] 2013[PDAT]", usehistory = "y")
#  117
117 articles. Now let’s fetch the records in XML format.
xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey,
rettype = "xml", retmax = es$count)
Next question: which XML element specifies the “Date of publication” (PDAT)?
Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.
I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.
2014-09-26 note: when Wikipedia pages change, as this one has, code breaks, as this code has; updates maintained at Github
6-way Venn banana
I thought nothing could top the classic “6-way Venn banana
“, featured in The banana (Musa acuminata) genome and the evolution of monocotyledonous plants
That is until I saw Figure 3 from Compact genome of the Antarctic midge is likely an adaptation to an extreme environment.
5-way Venn roadkill
What’s odd is that Figure 2 in the latter paper is a nice, clear R/ggplot2 creation, using facet_grid(), so someone knew what they were doing.
That aside, the Antarctic midge paper is an interesting read; go check it out.
This led to some amusing Twitter discussion which pointed me to *A New Rose : The First Simple Symmetric 11-Venn Diagram.
[*] +1 for referencing The Damned, if indeed that was the intention.
Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.
I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.
I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.
So: I was happy to make a small contribution recently in response to this request for help:
Read the rest…
There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.
One answer: we begin with exploratory data analysis and never get around to cleaning it up.
An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:
Update: those of you commenting “should have used Python instead” have completely missed the point. Your comments are off-topic and will not be published. Doubly-so when you get snarky about it.
Read the rest…
One of my more popular posts is A brief introduction to “apply” in R. Come August, it will be four years old. Technology moves on, old blog posts do not.
So: thanks to BioStar user zx8754 for pointing me to this Stack Overflow post, in which someone complains that the code in the post does not work as described. The by example is now fixed.
Side note: I often find “contact the author” is the most direct approach to solving this kind of problem ;) always happy to be contacted.
On a rare, brief holiday (here and here, if you’re interested; both highly-recommended), I make the mistake of checking my Twitter feed:
This points me to BoxPlotR. It draws box plots. Using Shiny Server. That’s the “innovation”, presumably.
With “quilt plots” and now this, I’m starting to think that I’ve been doing science wrong all these years. If I’d been told to submit the trivial computational work I do every single day to journals, I could have thousands of publications by now.
I’m still pretty relaxed post-holiday, so let’s just leave it there.
I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.
Laura DeMare asks: what was the most-hit gene?