I recently saw a tweet floating by which included a link to some recent statistics from PubMed Commons, the NCBI service for commenting on scientific articles in PubMed. Perhaps it was this post at their blog. So I thought now would be a good time to write some code to analyse PubMed Commons data.
The tl;dr version: here’s the Github repository and the RPubs report.
For further details and some charts, read on.
Read the rest…
New Zealand earthquake density 2010 – November 2016
Using R to add data to maps has been pretty straightforward for a few years now
. That said, it seems easier than ever to do things like use map APIs (e.g.
Google, Open Street Map), overlay quite complex data visualisations (e.g.
“heatmap-style” densities) and even generate animations.
A couple of key R packages in this space: ggmap and gganimate. To illustrate, I’ve used data from the recent New Zealand earthquake to generate some static maps and an animation. Here’s the Github repository and a report published at RPubs. Thanks to Florian Teschner for a great ggmap tutorial which got me started.
My own work in bioinformatics to date has not (sadly!) required much analysis of geospatial data but I can see use cases in many areas – environmental microbiology, for example.
I don’t “do politics” at this blog, but I’m always happy to do charts. Here’s one that’s been doing the rounds on Twitter recently:
What’s the first thing that comes into your mind on seeing that chart?
It seems that there are two main responses to the chart:
- Wow, what happened to all those Democrat voters between 2008 and 2016?
- Wow, that’s misleading, it makes it look like Democrat support almost halved between 2008 and 2016
The question then is: when (if ever) is it acceptable to start a y-axis at a non-zero value?
Read the rest…
It’s always nice when 12-month old code runs without a hitch. Not sure why this did not become a Github repo first time around, but now it is: my RMarkdown code to generate a report using data from the Nobel Prize API.
Now you too can generate a “gee, it’s all old white men” chart as seen in The Economist – Greying of the Nobel laureates, BBC News – Why are Nobel Prize winners getting older? and no doubt, many other outlets every year including me at RPubs, updated from 2015. As for myself, perhaps I should be offering my services to news outlets instead of publishing on blogs and obscure web platforms :)
It must be time for the annual report, kindly generated by the people from WordPress at the end of each year.
I’m pleased to see that I still averaged almost 2 posts a month, given that it was a difficult year in many ways (more on that later). Visitors from 202 countries! And if I never blogged again, it seems that people will want to learn about R’s apply functions for a long time to come.
2016 is going to be a bit “different”. Look out for the blog post which explains how and why, coming soon…
A recent tweet:
PubMed articles containing “novel” in title or abstract 1845 – 2014
made me think (1) has it really been 5 years, (2) gee, my ggplot skills were dreadful back then and (3) did I really not know how to correct for the increase in total publications?
So here is the update, at Github and a document at RPubs.
“Novel” findings, as judged by the usage of that word in titles and abstracts really have undergone a startling increase since about 1975. Indeed, almost 7.2% of findings were “novel” in 2014, compared with 3.2% for the period 1845 – 2014. That said, if we plot using a log scale as suggested by Tal on the original post, the rate of usage appears to be slowing down. See image, right (click for larger version).
As before, none of this is novel.
The Nobel Prizes. Love them? Hate them? Are they still relevant, meaningful? Go on admit it, you always imagined you would win one day.
Whatever you think of them, the 2015 results are in. What’s more, the good people of the Nobel Foundation offer us free access to data via an API. I’ve published a document over at RPubs, showing some of the ways to access and analyse their data using R. Just to get you started:
u <- "http://api.nobelprize.org/v1/laureate.json"
nobel <- fromJSON(u)
In this post, just the highlights. Click the images for larger versions.
I enjoyed this article by Keith Bradnam, and the associated tweets, on the problem of duplicated names for bioinformatics software.
I figured that to some degree at least, we should be able to search for such instances, since the titles of published articles that describe software often follow a particular pattern. There may even be a grammatical term for it, but I’ll call it the announcement colon:
eDuS: Segmental Duplication Simulator
Reveel: large-scale population genotyping using low-coverage sequencing data
RNF: a general framework to evaluate NGS read mappers
Hammock: A Hidden Markov model-based peptide clustering algorithm to identify protein-interaction consensus motifs in large datasets
You get the idea. “XXX COLON a [METHOD] to [DO SOMETHING] using [SOME DATA].”
Let’s go in search of announcement colons, using titles from the PubMed Central dataset. You can find this mini-project at Github.
ANXA11 expression in human smooth muscle aortic cells post-ILb1 exposure
About a year ago, I did a little work on a very interesting project which was trying to identify blood-based biomarkers for the early detection of stroke. The data included gene expression measurements using microarrays at various time points after the onset of ischemia (reduced blood supply). I had not worked with timecourse data before, so I went looking for methods and found a Bioconductor package, maSigPro
, which did exactly what I was looking for. In combination with ggplot2, it generated some very attractive and informative plots of gene expression over time.
I was very impressed by maSigPro and meant to get around to writing a short guide showing how to use it. So I did finally, using RMarkdown to create the document and here it is. The document also illustrates how to retrieve datasets from GEO using GEOquery and annotate microarray probesets using biomaRt. Hopefully it’s useful to some of you.
I’ll probably do more of this in the future, since publishing RMarkdown to RPubs is far easier than copying, pasting and formatting at WordPress.
Location of BLAST (tblastn) hits Mya arenaria GagPol (AIE48224.1) vs GOS contigs
Last week, I was listening to episode 337
of the podcast This Week in Virology
. It concerned a retrovirus-like sequence element named Steamer
, which is associated with a transmissible leukaemia in soft shell clams.
At one point the host and guests discussed the idea of searching for Steamer-like sequences in the data from ocean metagenomics projects, such as the Global Ocean Sampling expedition. Sounds like fun. So I made an initial attempt, using R/ggplot2 to visualise the results.
To make a long story short: the initial BLAST results are not super-convincing, the visualisation could use some work (click image, right, for larger version) and the code/data are all public at Github, summarised in this report. It made for a fun, relatively-quick side project.