Category Archives: R

PubMed retraction reporting update

Just a quick update to the previous post. At the helpful suggestion of Steve Royle, I’ve added a new section to the report which attempts to normalise retractions by journal. So for example, J. Biol. Chem. has (as of now) 94 retracted articles and in total 170 842 publications indexed in PubMed. That becomes (100 000 / 170 842) * 94 = 55.022 retractions per 100 000 articles.

Top 20 journals, retracted articles per 100 000 publications

Top 20 journals, retracted articles per 100 000 publications

This leads to some startling changes to the journals “top 20″ list. If you’re wondering what’s going on in the world of anaesthesiology, look no further (thanks again to Steve for the reminder).

PMRetract: PubMed retraction reporting rewritten as an interactive RMarkdown document

Back in 2010, I wrote a web application called PMRetract to monitor retraction notices in the PubMed database. It was written primarily as a way for me to explore some technologies: the Ruby web framework Sinatra, MongoDB (hosted at MongoHQ, now Compose) and Heroku, where the app was hosted.

I automated the update process using Rake and the whole thing ran pretty smoothly, in a “set and forget” kind of way for four years or so. However, the first era of PMRetract is over. Heroku have shut down git pushes to their “Bamboo Stack” – which runs applications using Ruby version 1.8.7 – and will shut down the stack on June 16 2015. Currently, I don’t have the time either to update my code for a newer Ruby version or to figure out the (frankly, near-unintelligible) instructions for migration to the newer Cedar stack.

So I figured now was a good time to learn some new skills, deal with a few issues and relaunch PMRetract as something easier to maintain and more portable. Here it is. As all the code is “out there” for viewing, I’ll just add few notes here regarding this latest incarnation.
Continue reading

Make prettier documents by reusing chunks in RMarkdown

No revelations here, just a little R tip for generating more readable documents.

Screenshot-RStudio.png

Original with lots of code at the top

There are times when I want to show code in a document, but I don’t want it to be the first thing that people see. What I want to see first is the output from that code. In this silly example, I want the reader to focus their attention on the result of myFunction(), which is 49.
Continue reading

Bioinformatics journals: time from submission to acceptance, revisited

Before we start: yes, we’ve been here before. There was the Biostars question “Calculating Time From Submission To Publication / Degree Of Burden In Submitting A Paper.” That gave rise to Pierre’s excellent blog post and code + data on Figshare.

So why are we here again? 1. It’s been a couple of years. 2. This is the R (+ Ruby) version. 3. It’s always worth highlighting how the poor state of publicly-available data prevents us from doing what we’d like to do. In this case the interesting question “which bioinformatics journal should I submit to for rapid publication?” becomes “here’s an incomplete analysis using questionable data regarding publication dates.”

Let’s get it out of the way then.
Continue reading

PubMed Publication Date: what is it, exactly?

File this one under “has troubled me (and others) for some years now, let’s try to resolve it.”

Let’s use the excellent R/rentrez package to search PubMed for articles that were retracted in 2013.

library(rentrez)

es <- entrez_search("pubmed", "\"Retracted Publication\"[PTYP] 2013[PDAT]", usehistory = "y")
es$count
# [1] 117

117 articles. Now let’s fetch the records in XML format.

xml <- entrez_fetch("pubmed", WebEnv = es$WebEnv, query_key = es$QueryKey, 
                    rettype = "xml", retmax = es$count)

Next question: which XML element specifies the “Date of publication” (PDAT)?
Continue reading

Ebola, Wikipedia and data janitors

Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.

I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.

2014-09-26 note: when Wikipedia pages change, as this one has, code breaks, as this code has; updates maintained at Github
Continue reading

Venn figures go wrong

6-way Venn banana

6-way Venn banana

I thought nothing could top the classic “6-way Venn banana“, featured in The banana (Musa acuminata) genome and the evolution of monocotyledonous plants.

That is until I saw Figure 3 from Compact genome of the Antarctic midge is likely an adaptation to an extreme environment.

5-way Venn roadkill

5-way Venn roadkill

What’s odd is that Figure 2 in the latter paper is a nice, clear R/ggplot2 creation, using facet_grid(), so someone knew what they were doing.

That aside, the Antarctic midge paper is an interesting read; go check it out.

This led to some amusing Twitter discussion which pointed me to *A New Rose : The First Simple Symmetric 11-Venn Diagram.


[*] +1 for referencing The Damned, if indeed that was the intention.

When life gives you coloured cells, make categories

Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column containing the words “normal” and “tumour”.

I almost hesitate to write this post but…we have to deal with the world as it is, not as we would like it to be. So in the interests of just getting the job done: here’s one way to deal with coloured cells in Excel, should someone send them your way.
Continue reading