Some years ago, Google discovered that when people are concerned about influenza, they search for flu-related information and that to some extent, search traffic is an indicator of flu activity. Google Flu Trends was born.
Google Trends: bronchitis
Illness is sweeping through our department this week and I have succumbed. It’s not flu but at one point, I did wonder if my symptoms were those of bronchitis
. Remembering Google Flu Trends, I thought I’d try my query
for “bronchitis” at Google Trends, where I saw the chart shown at right.
Interesting. Clearly seasonal, peaking around the latest and earliest months of each year. Winter, for those of you in the northern hemisphere.
- select USA and Australia as regions
- download the data in CSV format (I chose fixed scaling), rename files “us.csv” and “aus.csv”
- edit the files a little to retain only the “Week, bronchitis, bronchitis (std error)” section
Fire up your R console and try this:
us <- read.table("us.csv", header = T, sep = ",")
aus <- read.table("aus.csv", header = T, sep = ",")
# add a region column
us$region <- "usa"
aus$region <- "aus"
# combine data
alldata <- rbind(us, aus)
# add a date column
alldata$week <- strptime(alldata$Week, format = "%b %d %Y")
# and plot the non-zero values
ggplot(alldata[alldata$bronchitis > 0,], aes(as.Date(week), bronchitis)) + geom_line(aes(color = region)) + xlab("Date")
Google Trends: bronchitis, USA + Australia
Result shown at right: click for the full-size version.
That’s not unexpected, but it’s rather nice. In the USA peak searches for “bronchitis” coincide with troughs in Australia and vice-versa. The reason, of course, is that search peaks for both regions during winter, but winter in the USA (northern hemisphere) occurs during the southern summer (and again, vice-versa).
There must be all sorts of interesting and potentially useful information buried away in web usage data. I guess that’s why so many companies are investing in it. However, for those of us more interested in analysing data than marketing – what else is “out there”? Can we “do science” with it? How many papers are published using data gathered only from the Web?