Mapping the Vikings using R

The commute to my workplace is 90 minutes each way. Podcasts are my friend. I’m a long-time listener of In Our Time and enjoyed the recent episode about The Danelaw.

Melvyn and I hail from the same part of the world, and I learned as a child that many of the local place names there were derived from Old Norse or Danish. Notably: places ending in -by denote a farmstead, settlement or village; those ending in -thwaite mean a clearing or meadow.

So how local are those names? Time for some quick and dirty maps using R.
Continue reading

How long since your team scored 100+ points? This blog’s first foray into the fitzRoy R package

When this blog moved from bioinformatics to data science I ran a Twitter poll to ask whether I should start afresh at a new site or continue here. “Continue here”, you said.

So let’s test the tolerance of the long-time audience and celebrate the start of the 2019 season as we venture into the world of – Australian football (AFL) statistics!
Continue reading

This is not normal(ised)

“Sydney stations where commuters fall through gaps, get stuck in lifts” blares the headline. The story tells us that:

Central Station, the city’s busiest, topped the list last year with about 54 people falling through gaps

Wow! Wait a minute…

Central Station, the city’s busiest

Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.

Grabbing the numbers for 2017:


tibble(station = c("Central", "Circular Quay", "Redfern"),
       falls   = c(54, 34, 18),
       entries = c(118960, 27870, 30570)) %>%
  mutate(falls_per_entry = falls/entries) %>%
  select(-entries) %>%
  gather(Variable, Value, -station) %>%
  ggplot(aes(station, Value)) +
    geom_col() +
               scales = "free_y")


Looks like Circular Quay has the bigger problem. Now we have a data story. More tourists? Maybe improve the signage.

Deep in the comment thread, amidst the “only themselves to blame” crowd, one person gets it:

Sydney stations where commuters fall through gaps get stuck in lifts

Using parameters in Rmarkdown

Nothing new or original here, just something that I learned about quite recently that may be useful for others.

One of my more “popular” code repositories, judging by Twitter, is – well, Twitter. It mostly contains Rmarkdown reports which summarise meetings and conferences by analysing usage of their associated Twitter hashtags.

The reports follow a common template where the major difference is simply the hashtag. So one way to create these reports is to use the previous one, edit to find/replace the old hashtag with the new one, and save a new file.

That works…but what if we could define the hashtag once, then reuse it programmatically anywhere in the document? Enter Rmarkdown parameters.

Here’s an example .Rmd file. It’s fairly straightforward: just include a params: section in the YAML header at the top and include variables as key-value pairs:

  hashtag: "#amca19"
  max_n: 18000
  timezone: "US/Eastern"
title: "Twitter Coverage of `r params$hashtag`"
author: "Neil Saunders"
date: "`r Sys.time()`"

Then, wherever you want to include the value for the variable named hashtag, simply use params$hashtag, as in the title shown here or in later code chunks.

```{r search-twitter}
tweets <- search_tweets(params$hashtag, params$max_n)
saveRDS(tweets, "tweets.rds")

That's it! There may still be some customisation and editing specific to each report, but parameters go a long way to minimising that work.

An absolute beginner’s guide to creating data frames for a Stack Overflow [r] question

For better or worse I spend some time each day at Stack Overflow [r], reading and answering questions. If you do the same, you probably notice certain features in questions that recur frequently. It’s as though everyone is copying from one source – perhaps the one at the top of the search results. And it seems highest-ranked is not always best.

Nowhere is this more apparent to me than in the way many users create data frames. So here is my introductory guide “how not to create data frames”, aimed at beginners writing their first questions.

Continue reading

Price’s Protein Puzzle: 2019 update

Chains of amino acids strung together make up proteins and since each amino acid has a 1-letter abbreviation, we can find words (English and otherwise) in protein sequences. I imagine this pursuit began as soon as proteins were first sequenced, but the first reference to protein word-finding as a sport is, to my knowledge, “Price’s Protein Puzzle”, a letter to Trends in Biochemical Sciences in September 1987 [1].

Price wrote:

It occurred to me that TIBS could organise a competition to find the longest word […] contained within any known protein sequence.

The journal took up the challenge and published the winning entries in February 1988 [2]. The 7-letter winner was RERATED, with two 6-letter runners-up: LEADER and LIVELY. The sub-genre “biological words in protein sequences” was introduced almost one year later [3] with the discovery of ALLELE, then no more was heard until 1993 with Gonnet and Benner’s Nature correspondence “A Word in Your Protein” [4].

Noting that “none of the extensive literature devoted to this problem has taken a truly systematic approach” (it’s in Nature so one must declare superiority), this work is notable for two reasons. First, it discovered two 9-letter words: HIDALGISM and ENSILISTS. Second, it mentions the technique: a Patricia tree data structure, and that the search took 23 minutes.

Comments on this letter noted one protein sequence that ends with END [5] and the discovery of 10-letter, but non-English words ANNIDAVATE, WALLAWALLA and TARIEFKLAS [6].

I last visited this topic at my blog in 2008 and at someone else’s blog in 2015. So why am I here again? Because the Aho-Corasick algorithm in R, that’s why!

Continue reading

Using OSX? Compiling an R package from source? Issues with ‘-fopenmp’? Try this.

You can file this one under “I may have the very specific solution if you’re having exactly the same problem.”

So: if you’re running some R code and you see a warning like this:

Warning message:
In checkMatrixPackageVersion() : Package version inconsistency detected.
TMB was built with Matrix version 1.2.14
Current Matrix version is 1.2.15
Please re-install 'TMB' from source using 
install.packages('TMB', type = 'source') or ask CRAN for a binary 
version of 'TMB' matching CRAN's 'Matrix' package

Continue reading

Just use a scatterplot. Also, Sydney sprawls.

Dual-axes at tipping-point

Sydney’s congestion at ‘tipping point’ blares the headline and to illustrate, an interactive chart with bars for city population densities, points for commute times and of course, dual-axes.

Yuck. OK, I guess it does show that Sydney is one of three cities that are low density, but have comparable average commute times to higher-density cities. But if you’re plotting commute time versus population density…doesn’t a different kind of chart come to mind first? y versus x. C’mon.

Let’s explore.
Continue reading

Using leaflet, just because

I love it when researchers take the time to share their knowledge of the computational tools that they use. So first, let me point you at Environmental Computing, a site run by environmental scientists at the University of New South Wales, which has a good selection of R programming tutorials.

One of these is Making maps of your study sites. It was written with the specific purpose of generating simple, clean figures for publications and presentations, which it achieves very nicely.

I’ll be honest: the sole motivator for this post is that I thought it would be fun to generate the map using Leaflet for R as an alternative. You might use Leaflet if you want:

  • An interactive map that you can drag, zoom, click for popup information
  • A “fancier” static map with geographical features of interest
  • concise and clean code which uses pipes and doesn’t require that you process shapefiles

Continue reading