Debuting in a VFL/AFL Grand Final is rare

When Marlion Pickett runs onto the M.C.G for Richmond in the AFL Grand Final this Saturday, he’ll be only the sixth player in 124 finals to debut on the big day.

The sole purpose of this blog post is to illustrate how incredibly easy it is to figure this out, thanks to the dplyr and fitzRoy packages.

library(dplyr)
library(fitzRoy)

afldata <- get_afltables_stats()

afldata %>% 
  select(Season, Round, Date, ID, First.name, Surname, Playing.for, 
         Home.team, Home.score, Away.team, Away.score) %>% 
  group_by(ID) %>% 
  arrange(Date) %>%
  # a player's first game 
  slice(1) %>% 
  ungroup() %>% 
  # grand finals only
  filter(Round == "GF") %>%
  # get the winning/losing margin 
  mutate(Margin = case_when(Playing.for == Home.team ~ Home.score - Away.score,
                            TRUE ~ Away.score - Home.score)) %>% 
  select(-Home.team, -Away.team, -Home.score, -Away.score)
Season Round Date ID First.name Surname Playing.for Margin
1908 GF 1908-09-26 5573 Harry Prout Essendon -9
1920 GF 1920-10-02 6677 Billy James Richmond 17
1923 GF 1923-10-20 6915 George Rawle Essendon 17
1926 GF 1926-10-09 3824 Francis Vine Melbourne 57
1952 GF 1952-09-27 9361 Keith Batchelor Collingwood -46

Extracting Sydney transport data from Twitter

The @sydstats Twitter account uses this code base, and data from the Transport for NSW Open Data API to publish insights into delays on the Sydney Trains network.

Each tweet takes one of two forms and is consistently formatted, making it easy to parse and extract information. Here are a couple of examples with the interesting parts highlighted in bold:

Between 16:00 and 18:30 today, 26% of trips experienced delays. #sydneytrains

The worst delay was 16 minutes, on the 18:16 City to Berowra via Gordon service. #sydneytrains


I’ve created a Github repository with code and a report showing some ways in which this data can be explored.

The take-home message: expect delays somewhere most days but in particular on Monday mornings, when students return to school after the holidays, and if you’re travelling in the far south-west or north-west of the network.

Mapping the Vikings using R

The commute to my workplace is 90 minutes each way. Podcasts are my friend. I’m a long-time listener of In Our Time and enjoyed the recent episode about The Danelaw.

Melvyn and I hail from the same part of the world, and I learned as a child that many of the local place names there were derived from Old Norse or Danish. Notably: places ending in -by denote a farmstead, settlement or village; those ending in -thwaite mean a clearing or meadow.

So how local are those names? Time for some quick and dirty maps using R.
Continue reading

An absolute beginner’s guide to creating data frames for a Stack Overflow [r] question

For better or worse I spend some time each day at Stack Overflow [r], reading and answering questions. If you do the same, you probably notice certain features in questions that recur frequently. It’s as though everyone is copying from one source – perhaps the one at the top of the search results. And it seems highest-ranked is not always best.

Nowhere is this more apparent to me than in the way many users create data frames. So here is my introductory guide “how not to create data frames”, aimed at beginners writing their first questions.

Continue reading

Just use a scatterplot. Also, Sydney sprawls.

Dual-axes at tipping-point

Sydney’s congestion at ‘tipping point’ blares the headline and to illustrate, an interactive chart with bars for city population densities, points for commute times and of course, dual-axes.

Yuck. OK, I guess it does show that Sydney is one of three cities that are low density, but have comparable average commute times to higher-density cities. But if you’re plotting commute time versus population density…doesn’t a different kind of chart come to mind first? y versus x. C’mon.

Let’s explore.
Continue reading

Using leaflet, just because

I love it when researchers take the time to share their knowledge of the computational tools that they use. So first, let me point you at Environmental Computing, a site run by environmental scientists at the University of New South Wales, which has a good selection of R programming tutorials.

One of these is Making maps of your study sites. It was written with the specific purpose of generating simple, clean figures for publications and presentations, which it achieves very nicely.

I’ll be honest: the sole motivator for this post is that I thought it would be fun to generate the map using Leaflet for R as an alternative. You might use Leaflet if you want:

  • An interactive map that you can drag, zoom, click for popup information
  • A “fancier” static map with geographical features of interest
  • concise and clean code which uses pipes and doesn’t require that you process shapefiles

Continue reading

Twitter coverage of the useR! 2018 conference

In summary:

The code that generated the report (which I’ve used heavily and written about before) is at Github too. A few changes required compared with previous reports, due to changes in the rtweet package, and a weird issue with kable tables breaking markdown headers.

I love that the most popular media attachment is a screenshot of a Github repo.