Taking steps (in XML)

So the votes are in:

I thank you, kind readers. So here’s the plan: (1) keep blogging here as frequently as possible (perhaps monthly), (2) on more general “how to do cool stuff with data and R” topics, (3) which may still include biology from time to time. Sounds OK? Good.

So: let’s use R to analyse data from the iOS Health app.

Read the rest…

“Health Hack”: crossing the line between hackfest and unpaid labour

I’ve never attended a hackathon (hack day, hackfest or codefest). My impression of them is that there is, generally, a strong element of “working for the public good”: seeking to use code and data in new ways that maximise benefit and build communities.

Which is why I’m somewhat mystified by the projects on offer at the Sydney HealthHack. They read like tenders for consultants. Unpaid consultants.

The projects – a pedigree drawing tool, a workflow to process microscopy images, a statistical calculator and a mutation discovery pipeline – all describe problems that competent bioinformaticians could solve using existing tools in a relatively short time. For example, off the top of my head, ImageJ or CSIRO’s Workspace might be worth looking at for problem (2). The steps described in problem (4) – copy and paste between spreadsheets, manual inspection and manipulation of sequence data – should be depressingly familiar examples to many bioinformaticians. This project can be summarised simply as “you’re doing it wrong because you don’t know any better.”

The overall tone is “my research group requires this tool, but we’re unable to employ anyone to do it.” There is no sense of anything wider than the immediate needs of individual researchers. This does not seem, to me, what hackfest philosophy is all about.

This raises an issue that I think about a lot: how do we (the science community) best get the people with the expertise (in this case, bioinformaticians) to the people with the problems? In an ideal world the answer would be “everyone should employ at least one.” I wonder about the market (Australian or more generally) for paid consulting “biological data scientists”? We complain that we’re under-valued; well, perhaps it is we who are doing the valuation when we offer our skills for free.

Friday fun with: Google Trends

Some years ago, Google discovered that when people are concerned about influenza, they search for flu-related information and that to some extent, search traffic is an indicator of flu activity. Google Flu Trends was born.


Google Trends: bronchitis

Illness is sweeping through our department this week and I have succumbed. It’s not flu but at one point, I did wonder if my symptoms were those of bronchitis. Remembering Google Flu Trends, I thought I’d try my query for “bronchitis” at Google Trends, where I saw the chart shown at right.

Interesting. Clearly seasonal, peaking around the latest and earliest months of each year. Winter, for those of you in the northern hemisphere.


  • select USA and Australia as regions
  • download the data in CSV format (I chose fixed scaling), rename files “us.csv” and “aus.csv”
  • edit the files a little to retain only the “Week, bronchitis, bronchitis (std error)” section

Fire up your R console and try this:

us <- read.table("us.csv", header = T, sep = ",")
aus <- read.table("aus.csv", header = T, sep = ",")
# add a region column
us$region <- "usa"
aus$region <- "aus"
# combine data
alldata <- rbind(us, aus)
# add a date column
alldata$week <- strptime(alldata$Week, format = "%b %d %Y")
# and plot the non-zero values
ggplot(alldata[alldata$bronchitis > 0,], aes(as.Date(week), bronchitis)) + geom_line(aes(color = region)) + xlab("Date")


Google Trends: bronchitis, USA + Australia

Result shown at right: click for the full-size version.

That’s not unexpected, but it’s rather nice. In the USA peak searches for “bronchitis” coincide with troughs in Australia and vice-versa. The reason, of course, is that search peaks for both regions during winter, but winter in the USA (northern hemisphere) occurs during the southern summer (and again, vice-versa).

There must be all sorts of interesting and potentially useful information buried away in web usage data. I guess that’s why so many companies are investing in it. However, for those of us more interested in analysing data than marketing – what else is “out there”? Can we “do science” with it? How many papers are published using data gathered only from the Web?