I missed it first time around but apparently, back in October, Nature published a somewhat-controversial article: Evidence for a limit to human lifespan. It came to my attention in a recent tweet:
The source: a fact-check article from Dutch news organisation NRC titled “Nature article is wrong about 115 year limit on human lifespan“. NRC seem rather interested in this research article. They have published another more recent critique of the work, titled “Statistical problems, but not enough to warrant a rejection” and a discussion of that critique, Peer review post-mortem: how a flawed aging study was published in Nature.
Unfortunately, the first NRC article does itself no favours by using non-comparable x-axis scales for its charts and not really explaining very well how the different datasets (IDL and GRG) were used. Data nerds everywhere then, are wondering whether to repeat the analysis themselves and perhaps fire off a letter to Nature.
Read the rest…
I recently saw a tweet floating by which included a link to some recent statistics from PubMed Commons, the NCBI service for commenting on scientific articles in PubMed. Perhaps it was this post at their blog. So I thought now would be a good time to write some code to analyse PubMed Commons data.
The tl;dr version: here’s the Github repository and the RPubs report.
For further details and some charts, read on.
Read the rest…
New Zealand earthquake density 2010 – November 2016
Using R to add data to maps has been pretty straightforward for a few years now
. That said, it seems easier than ever to do things like use map APIs (e.g.
Google, Open Street Map), overlay quite complex data visualisations (e.g.
“heatmap-style” densities) and even generate animations.
A couple of key R packages in this space: ggmap and gganimate. To illustrate, I’ve used data from the recent New Zealand earthquake to generate some static maps and an animation. Here’s the Github repository and a report published at RPubs. Thanks to Florian Teschner for a great ggmap tutorial which got me started.
My own work in bioinformatics to date has not (sadly!) required much analysis of geospatial data but I can see use cases in many areas – environmental microbiology, for example.
I don’t “do politics” at this blog, but I’m always happy to do charts. Here’s one that’s been doing the rounds on Twitter recently:
What’s the first thing that comes into your mind on seeing that chart?
It seems that there are two main responses to the chart:
- Wow, what happened to all those Democrat voters between 2008 and 2016?
- Wow, that’s misleading, it makes it look like Democrat support almost halved between 2008 and 2016
The question then is: when (if ever) is it acceptable to start a y-axis at a non-zero value?
Read the rest…
It’s always nice when 12-month old code runs without a hitch. Not sure why this did not become a Github repo first time around, but now it is: my RMarkdown code to generate a report using data from the Nobel Prize API.
Now you too can generate a “gee, it’s all old white men” chart as seen in The Economist – Greying of the Nobel laureates, BBC News – Why are Nobel Prize winners getting older? and no doubt, many other outlets every year including me at RPubs, updated from 2015. As for myself, perhaps I should be offering my services to news outlets instead of publishing on blogs and obscure web platforms :)
Why, it seems like only 12 years since we read Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.
And can it really be 4 years since we reviewed the topic of gene name corruption in Gene name errors and Excel: lessons not learned?
Well, here we are again in 2016 with Gene name errors are widespread in the scientific literature. This study examined 35 175 supplementary Excel data files from 3 597 published articles. Simple yet clever, isn’t it. I bet you wish you’d thought of doing that. I do. The conclusion: about 20% of the articles have associated data files in which gene names have been corrupted by Excel.
What if there is no tomorrow? There wasn’t one today.
We tell you not to use Excel. You counter with a host of reasons why you have to use Excel. None of them are good reasons. I don’t know what else to say. Except to reiterate that probably 80% or more of the data analyst’s time is spent on data cleaning and a good proportion of the dirt arises from avoidable errors.
May. No blog posts yet in 2016. “What’s going on Neil?” asked no-one at all. For anyone who may be wondering…
Last November, I resigned from my position with my previous employer after almost 7 years. Just before Christmas, I was offered a position as a data scientist with a Sydney-based healthcare technology start-up. I started working there in early January and so far, it has been a terrific experience. Had I known how enjoyable it could be, I would have made a move like this 10 years ago. Career advice: there are many more jobs that can engage scientists and utilise their skills than academic research.
So what does that mean for this blog? It means that I’m no longer a researcher, at least in the narrow sense that science would use that word. It means that the things I learn during a working day are unlikely to translate into blog posts of broader interest (confidentiality issues not withstanding). And quite frankly, given where I’m at in my life (balancing working for a startup with raising my family), it means that I no longer have time to write regular blog posts.
Like a band that never officially breaks up, I’m not ready to declare the end just yet. So I’m placing the blog “on hiatus”, indefinitely. I’ll still be active online, which right now mostly means Twitter.
It must be time for the annual report, kindly generated by the people from WordPress at the end of each year.
I’m pleased to see that I still averaged almost 2 posts a month, given that it was a difficult year in many ways (more on that later). Visitors from 202 countries! And if I never blogged again, it seems that people will want to learn about R’s apply functions for a long time to come.
2016 is going to be a bit “different”. Look out for the blog post which explains how and why, coming soon…
Just a short note to alert you to a publication with my name on it. Great work by lead author and former colleague Aidan; I just did “the Gephi stuff”. If you’re interested in bioinformatics applications of Apache Spark, take a look at:
VariantSpark: population scale clustering of genotype information
Happy to report it is open access.
A recent tweet:
PubMed articles containing “novel” in title or abstract 1845 – 2014
made me think (1) has it really been 5 years, (2) gee, my ggplot skills were dreadful back then and (3) did I really not know how to correct for the increase in total publications?
So here is the update, at Github and a document at RPubs.
“Novel” findings, as judged by the usage of that word in titles and abstracts really have undergone a startling increase since about 1975. Indeed, almost 7.2% of findings were “novel” in 2014, compared with 3.2% for the period 1845 – 2014. That said, if we plot using a log scale as suggested by Tal on the original post, the rate of usage appears to be slowing down. See image, right (click for larger version).
As before, none of this is novel.