- Last week was useR! conference time again, coming to you this time from Toulouse, France
- I’ve retrieved 8 318 tweets that mention #user2019 and run them through my report generator
- And here are the results
Take-home message this year: the R Ladies rock!
I’m not saying this is a good idea, but bear with me.
This week we return to Australian Rules Football, the R package fitzRoy and some statistics to ask – why can’t Geelong win after a bye?
(with apologies to long-time readers who used to come for the science)
Why would you even ask that? Well, because this.
I sense problems immediately. First, the story is tagged “evolution”. The horns are not arising through inheritance of advantageous mutations, so that isn’t evolution.
Yes last time I checked, horns were external and pointed upwards. The X-ray seems to show an internal, downward-pointing bone growth.
But wait, there’s more.
The commute to my workplace is 90 minutes each way. Podcasts are my friend. I’m a long-time listener of In Our Time and enjoyed the recent episode about The Danelaw.
Melvyn and I hail from the same part of the world, and I learned as a child that many of the local place names there were derived from Old Norse or Danish. Notably: places ending in -by denote a farmstead, settlement or village; those ending in -thwaite mean a clearing or meadow.
So how local are those names? Time for some quick and dirty maps using R.
When this blog moved from bioinformatics to data science I ran a Twitter poll to ask whether I should start afresh at a new site or continue here. “Continue here”, you said.
So let’s test the tolerance of the long-time audience and celebrate the start of the 2019 season as we venture into the world of – Australian football (AFL) statistics!
“Sydney stations where commuters fall through gaps, get stuck in lifts” blares the headline. The story tells us that:
Central Station, the city’s busiest, topped the list last year with about 54 people falling through gaps
Wow! Wait a minute…
Central Station, the city’s busiest
Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.
Grabbing the numbers for 2017:
tibble(station = c("Central", "Circular Quay", "Redfern"),
falls = c(54, 34, 18),
entries = c(118960, 27870, 30570)) %>%
mutate(falls_per_entry = falls/entries) %>%
gather(Variable, Value, -station) %>%
ggplot(aes(station, Value)) +
scales = "free_y")
Looks like Circular Quay has the bigger problem. Now we have a data story. More tourists? Maybe improve the signage.
Deep in the comment thread, amidst the “only themselves to blame” crowd, one person gets it:
Nothing new or original here, just something that I learned about quite recently that may be useful for others.
One of my more “popular” code repositories, judging by Twitter, is – well, Twitter. It mostly contains Rmarkdown reports which summarise meetings and conferences by analysing usage of their associated Twitter hashtags.
The reports follow a common template where the major difference is simply the hashtag. So one way to create these reports is to use the previous one, edit to find/replace the old hashtag with the new one, and save a new file.
That works…but what if we could define the hashtag once, then reuse it programmatically anywhere in the document? Enter Rmarkdown parameters.
Here’s an example .Rmd file. It’s fairly straightforward: just include a
params: section in the YAML header at the top and include variables as key-value pairs:
title: "Twitter Coverage of `r params$hashtag`"
author: "Neil Saunders"
date: "`r Sys.time()`"
Then, wherever you want to include the value for the variable named
hashtag, simply use
params$hashtag, as in the
title shown here or in later code chunks.
tweets <- search_tweets(params$hashtag, params$max_n)
That's it! There may still be some customisation and editing specific to each report, but parameters go a long way to minimising that work.
Various people have suggested that taking a break from social networks – Twitter in particular – can be A Good Thing™.
So I tried it, for a couple of weeks. Here’s what I learned.
For better or worse I spend some time each day at Stack Overflow [r], reading and answering questions. If you do the same, you probably notice certain features in questions that recur frequently. It’s as though everyone is copying from one source – perhaps the one at the top of the search results. And it seems highest-ranked is not always best.
Nowhere is this more apparent to me than in the way many users create data frames. So here is my introductory guide “how not to create data frames”, aimed at beginners writing their first questions.