It’s been 3 years since we last visited that old favourite recurring topic, data corruption by Excel. Specifically, the unwanted auto-conversion of identifiers that look like dates, e.g. SEPT1, to – well, dates.
Here’s a new twist – well, a two year-old twist in fact, as I don’t keep up to date with this field any longer:
Yes, in 2017 the HGNC decided that the solution to this long-standing issue is to rename the offending genes to prevent the auto-conversion. I’m yet to determine whether anything more came of the proposal.
It is I suppose a practical suggestion that will work. The newsletter states that:
Our initial consultation with the research community publishing on these genes had very mixed results
I bet it did. However, given that ongoing consultation with the research community about the inappropriate use of software has had essentially no results in 15+ years, perhaps it is the most effective solution to the problem.
When Marlion Pickett runs onto the M.C.G for Richmond in the AFL Grand Final this Saturday, he’ll be only the sixth player in 124 finals to debut on the big day.
The sole purpose of this blog post is to illustrate how incredibly easy it is to figure this out, thanks to the dplyr and fitzRoy packages.
afldata <- get_afltables_stats()
select(Season, Round, Date, ID, First.name, Surname, Playing.for,
Home.team, Home.score, Away.team, Away.score) %>%
# a player's first game
# grand finals only
filter(Round == "GF") %>%
# get the winning/losing margin
mutate(Margin = case_when(Playing.for == Home.team ~ Home.score - Away.score,
TRUE ~ Away.score - Home.score)) %>%
select(-Home.team, -Away.team, -Home.score, -Away.score)
The @sydstats Twitter account uses this code base, and data from the Transport for NSW Open Data API to publish insights into delays on the Sydney Trains network.
Each tweet takes one of two forms and is consistently formatted, making it easy to parse and extract information. Here are a couple of examples with the interesting parts highlighted in bold:
Between 16:00 and 18:30 today, 26% of trips experienced delays. #sydneytrains
The worst delay was 16 minutes, on the 18:16 City to Berowra via Gordon service. #sydneytrains
I’ve created a Github repository with code and a report showing some ways in which this data can be explored.
The take-home message: expect delays somewhere most days but in particular on Monday mornings, when students return to school after the holidays, and if you’re travelling in the far south-west or north-west of the network.
- Last week was useR! conference time again, coming to you this time from Toulouse, France
- I’ve retrieved 8 318 tweets that mention #user2019 and run them through my report generator
- And here are the results
Take-home message this year: the R Ladies rock!
I’m not saying this is a good idea, but bear with me.
This week we return to Australian Rules Football, the R package fitzRoy and some statistics to ask – why can’t Geelong win after a bye?
(with apologies to long-time readers who used to come for the science)
Why would you even ask that? Well, because this.
I sense problems immediately. First, the story is tagged “evolution”. The horns are not arising through inheritance of advantageous mutations, so that isn’t evolution.
Yes last time I checked, horns were external and pointed upwards. The X-ray seems to show an internal, downward-pointing bone growth.
But wait, there’s more.
The commute to my workplace is 90 minutes each way. Podcasts are my friend. I’m a long-time listener of In Our Time and enjoyed the recent episode about The Danelaw.
Melvyn and I hail from the same part of the world, and I learned as a child that many of the local place names there were derived from Old Norse or Danish. Notably: places ending in -by denote a farmstead, settlement or village; those ending in -thwaite mean a clearing or meadow.
So how local are those names? Time for some quick and dirty maps using R.
When this blog moved from bioinformatics to data science I ran a Twitter poll to ask whether I should start afresh at a new site or continue here. “Continue here”, you said.
So let’s test the tolerance of the long-time audience and celebrate the start of the 2019 season as we venture into the world of – Australian football (AFL) statistics!
“Sydney stations where commuters fall through gaps, get stuck in lifts” blares the headline. The story tells us that:
Central Station, the city’s busiest, topped the list last year with about 54 people falling through gaps
Wow! Wait a minute…
Central Station, the city’s busiest
Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.
Grabbing the numbers for 2017:
tibble(station = c("Central", "Circular Quay", "Redfern"),
falls = c(54, 34, 18),
entries = c(118960, 27870, 30570)) %>%
mutate(falls_per_entry = falls/entries) %>%
gather(Variable, Value, -station) %>%
ggplot(aes(station, Value)) +
scales = "free_y")
Looks like Circular Quay has the bigger problem. Now we have a data story. More tourists? Maybe improve the signage.
Deep in the comment thread, amidst the “only themselves to blame” crowd, one person gets it: