Some years ago, Google discovered that when people are concerned about influenza, they search for flu-related information and that to some extent, search traffic is an indicator of flu activity. Google Flu Trends was born.
Illness is sweeping through our department this week and I have succumbed. It’s not flu but at one point, I did wonder if my symptoms were those of bronchitis. Remembering Google Flu Trends, I thought I’d try my query for “bronchitis” at Google Trends, where I saw the chart shown at right.Interesting. Clearly seasonal, peaking around the latest and earliest months of each year. Winter, for those of you in the northern hemisphere.
Next:
- select USA and Australia as regions
- download the data in CSV format (I chose fixed scaling), rename files “us.csv” and “aus.csv”
- edit the files a little to retain only the “Week, bronchitis, bronchitis (std error)” section
Fire up your R console and try this:
library(ggplot2) us <- read.table("us.csv", header = T, sep = ",") aus <- read.table("aus.csv", header = T, sep = ",") # add a region column us$region <- "usa" aus$region <- "aus" # combine data alldata <- rbind(us, aus) # add a date column alldata$week <- strptime(alldata$Week, format = "%b %d %Y") # and plot the non-zero values ggplot(alldata[alldata$bronchitis > 0,], aes(as.Date(week), bronchitis)) + geom_line(aes(color = region)) + xlab("Date")Result shown at right: click for the full-size version.
That’s not unexpected, but it’s rather nice. In the USA peak searches for “bronchitis” coincide with troughs in Australia and vice-versa. The reason, of course, is that search peaks for both regions during winter, but winter in the USA (northern hemisphere) occurs during the southern summer (and again, vice-versa).
There must be all sorts of interesting and potentially useful information buried away in web usage data. I guess that’s why so many companies are investing in it. However, for those of us more interested in analysing data than marketing – what else is “out there”? Can we “do science” with it? How many papers are published using data gathered only from the Web?
fun post, neil.
we did some research on the value of search for prediction tasks that you might find interesting. see sharad’s blog post or our paper for more info.
A paper in PLoS One last year studied seasonal trends in depression using Google searches (http://www.ncbi.nlm.nih.gov/pubmed/21060851). Like your US/Australia comparison, they also used geographical information to show that the trends are similar, but offset, in the northern and southern hemispheres. Pretty interesting story, unless of course you live at extreme latitudes.
ngrams is a potential tool for this sort of thing, basically Google Trends but on a longer scale.
You can, for instance, show that computers cause cancer:
http://ngrams.googlelabs.com/graph?content=cancer%2Ccomputer&year_start=1800&year_end=2000&corpus=0&smoothing=0
Funny.
I have no faith in the ngrams data whatsoever. In addition to the well-known problems with OCR, a lot of the records seem simply to have the wrong date. A lot of hype and fuss around “big data”, with no discussion of data quality.
Good thing you made those checks.
Doing the trend for breast cancer indicates it’s very contagious in the United States around Breast Cancer Awareness Month!