# Extracting Sydney transport data from Twitter

The @sydstats Twitter account uses this code base, and data from the Transport for NSW Open Data API to publish insights into delays on the Sydney Trains network.

Each tweet takes one of two forms and is consistently formatted, making it easy to parse and extract information. Here are a couple of examples with the interesting parts highlighted in bold:

Between 16:00 and 18:30 today, 26% of trips experienced delays. #sydneytrains

The worst delay was 16 minutes, on the 18:16 City to Berowra via Gordon service. #sydneytrains

I’ve created a Github repository with code and a report showing some ways in which this data can be explored.

The take-home message: expect delays somewhere most days but in particular on Monday mornings, when students return to school after the holidays, and if you’re travelling in the far south-west or north-west of the network.

# This is not normal(ised)

“Sydney stations where commuters fall through gaps, get stuck in lifts” blares the headline. The story tells us that:

Central Station, the city’s busiest, topped the list last year with about 54 people falling through gaps

Wow! Wait a minute…

Central Station, the city’s busiest

Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.

Grabbing the numbers for 2017:

```library(tidyverse)

tibble(station = c("Central", "Circular Quay", "Redfern"),
falls   = c(54, 34, 18),
entries = c(118960, 27870, 30570)) %>%
mutate(falls_per_entry = falls/entries) %>%
select(-entries) %>%
gather(Variable, Value, -station) %>%
ggplot(aes(station, Value)) +
geom_col() +
facet_wrap(~Variable,
scales = "free_y")
```

Looks like Circular Quay has the bigger problem. Now we have a data story. More tourists? Maybe improve the signage.

Deep in the comment thread, amidst the “only themselves to blame” crowd, one person gets it: