Extracting Sydney transport data from Twitter

The @sydstats Twitter account uses this code base, and data from the Transport for NSW Open Data API to publish insights into delays on the Sydney Trains network.

Each tweet takes one of two forms and is consistently formatted, making it easy to parse and extract information. Here are a couple of examples with the interesting parts highlighted in bold:

Between 16:00 and 18:30 today, 26% of trips experienced delays. #sydneytrains

The worst delay was 16 minutes, on the 18:16 City to Berowra via Gordon service. #sydneytrains


I’ve created a Github repository with code and a report showing some ways in which this data can be explored.

The take-home message: expect delays somewhere most days but in particular on Monday mornings, when students return to school after the holidays, and if you’re travelling in the far south-west or north-west of the network.

Just use a scatterplot. Also, Sydney sprawls.

Dual-axes at tipping-point

Sydney’s congestion at ‘tipping point’ blares the headline and to illustrate, an interactive chart with bars for city population densities, points for commute times and of course, dual-axes.

Yuck. OK, I guess it does show that Sydney is one of three cities that are low density, but have comparable average commute times to higher-density cities. But if you’re plotting commute time versus population density…doesn’t a different kind of chart come to mind first? y versus x. C’mon.

Let’s explore.
Continue reading

Feels like a dry winter – but what does the data say?

Update Feb 9 2020: Weather Underground retired their free API in 2018 so the code in this post no longer works

A reminder that when idle queries pop into your head, the answer can often be found using R + online data. And a brief excursion into accessing the Weather Underground.

One interesting aspect of Australian life, even in coastal urban areas like Sydney, is that sometimes it just stops raining. For weeks or months at a time. The realisation hits slowly: at some point you look around at the yellow-brown lawns, ovals and “nature strips” and say “gee, I don’t remember the last time it rained.”

Thankfully in our data-rich world, it’s relatively easy to find out whether the dry spell is really as long as it feels. In Australia, meteorological data is readily available via the Bureau of Meteorology (known as BoM). Another source is the Weather Underground (WU), which has the benefit that there may be data from a personal weather station much closer to you than the BoM stations.

Here’s how you can access WU data using R and see whether your fuzzy recollection is matched by reality.
Continue reading

Chart golf: the “demographic tsunami”

“‘Demographic tsunami’ will keep Sydney, Melbourne property prices high” screams the headline.

While the census showed Australia overall is aging, there’s been a noticeable lift in the number of people aged between 25 to 32.
As the accompanying graph shows…

Whoa, that is one ugly chart. First thought: let’s not be too hard on Fairfax Media, they’ve sacked most of their real journalists and they took the chart from someone else. Second thought: if you want to visualise change over time, time as an axis rather than a coloured bar is generally a good idea.

Can we do better?
Continue reading

“Health Hack”: crossing the line between hackfest and unpaid labour

I’ve never attended a hackathon (hack day, hackfest or codefest). My impression of them is that there is, generally, a strong element of “working for the public good”: seeking to use code and data in new ways that maximise benefit and build communities.

Which is why I’m somewhat mystified by the projects on offer at the Sydney HealthHack. They read like tenders for consultants. Unpaid consultants.

The projects – a pedigree drawing tool, a workflow to process microscopy images, a statistical calculator and a mutation discovery pipeline – all describe problems that competent bioinformaticians could solve using existing tools in a relatively short time. For example, off the top of my head, ImageJ or CSIRO’s Workspace might be worth looking at for problem (2). The steps described in problem (4) – copy and paste between spreadsheets, manual inspection and manipulation of sequence data – should be depressingly familiar examples to many bioinformaticians. This project can be summarised simply as “you’re doing it wrong because you don’t know any better.”

The overall tone is “my research group requires this tool, but we’re unable to employ anyone to do it.” There is no sense of anything wider than the immediate needs of individual researchers. This does not seem, to me, what hackfest philosophy is all about.

This raises an issue that I think about a lot: how do we (the science community) best get the people with the expertise (in this case, bioinformaticians) to the people with the problems? In an ideal world the answer would be “everyone should employ at least one.” I wonder about the market (Australian or more generally) for paid consulting “biological data scientists”? We complain that we’re under-valued; well, perhaps it is we who are doing the valuation when we offer our skills for free.

My day out at #osddmalaria

Finally, I get around to telling you that…
…on Friday 24th February, I took a day out from my regular job to attend a meeting on Open Source Drug Discovery for Malaria. I should state straight away that whilst drug discovery and chem(o)informatics are topics that I find very interesting, I have no professional experience or connections in either area. However, it was an opportunity to learn more, listen to some great speakers, think about what bioinformaticians might be able to bring to the table and of course, finally meet Mat Todd in person. Mat, if you don’t know, is one of the few people on the planet who really does science online, as opposed to talking about science online.

Here’s what I learned – with just a little analysis using R later in the post, hence the statistics/R category.
Read the rest…