Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.
I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.
2014-09-26 note: when Wikipedia pages change, as this one has, code breaks, as this code has; updates maintained at Github
The Wikipedia page includes a data table, which is a starting point. It’s not especially well-designed (click image at right to see the headers and a few rows) and the notes underneath suggest that a large amount of manual intervention was required to obtain the numbers.
The last column contains hyperlinked references. Now we see why so much manual work was required. The citations link out to two main types of information:
- Paragraphs of free text with numbers, somewhere in amongst it, like this example
- Infographic-style reports in PDF format, like this example
That’s more wrangling than I have time for just now; OK, so the Wikipedia table it is. Still a little more “wrangling” to get the data out of that HTML table.
edited 2014-09-24 based on comment from Rainer
library(XML) library(ggplot2) library(reshape2) # get all tables on the page ebola <- readHTMLTable("http://en.wikipedia.org/wiki/Ebola_virus_epidemic_in_West_Africa", stringsAsFactors = FALSE) # thankfully our table has a name; it is table #5 # this is not something you can really automate head(names(ebola)) #  "Ebola virus epidemic in West Africa" #  "Nigeria Ebola areas-2014" #  "Treatment facilities in West Africa" #  "Democratic Republic of Congo-2014" #  "Ebola cases and deaths by country and by date" #  "NULL" ebola <- ebola$`Ebola cases and deaths by country and by date` # again, manual examination reveals that we want rows 2-71 and columns 1-3 ebola.new <- ebola[2:nrow(ebola), 1:3] colnames(ebola.new) <- c("date", "cases", "deaths") # need to fix up a couple of cases that contain text other than the numbers ebola.new$cases[nrow(ebola.new)-43] <- "759" ebola.new$deaths[nrow(ebola.new)-43] <- "467" # get rid of the commas; convert to numeric ebola.new$cases <- gsub(",", "", ebola.new$cases) ebola.new$cases <- as.numeric(ebola.new$cases) ebola.new$deaths <- gsub(",", "", ebola.new$deaths) ebola.new$deaths <- as.numeric(ebola.new$deaths) # the days in the dates are encoded 1-31 # are we there yet? quick and dirty attempt to reproduce Wikipedia plot ebola.m <- melt(ebola.new) ggplot(ebola.m, aes(as.Date(date, "%e %b %Y"), value)) + geom_point(aes(color = variable)) + coord_trans(y = "log10") + xlab("Date") + labs(title = "Cumulative totals log scale") + theme_bw()
Result: on the right, click for full-size.
We can complain: if only the WHO, CDC and other organisations provided data as a web service. Or even as files in CSV format. Anything but PDF. But right now at least, they do not. So hats off to the heroic efforts of the Wikipedian so-called “data janitors“. From that article:
“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone
Surprising? Not to scientists (who don’t qualify the profession with the redundant word “data”). “Key hurdle to insights”, says the article title? Not really – just part and parcel of the job. I’d even argue that effective wrangling is where most of the skills are required. So perhaps, think twice before belittling peoples extensive skill sets with terms like “janitor”. You might need them to wrangle your data some day.