Ebola, Wikipedia and data janitors

Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation with the term “data science”. Not to mention Ebola, of course.

I imagine that a lot of people with an interest in biological data are following this story and thinking “how can I visualise the numbers for myself?” Maybe you’d like to reproduce the plots in the Timeline section of that Wikipedia entry. Surprise: the raw numbers are not that easy to obtain.

2014-09-26 note: when Wikipedia pages change, as this one has, code breaks, as this code has; updates maintained at Github

Ebola virus epidemic in West Africa   Wikipedia  the free encyclopedia

Ebola cases and deaths by country and by date, from Wikipedia

The Wikipedia page includes a data table, which is a starting point. It’s not especially well-designed (click image at right to see the headers and a few rows) and the notes underneath suggest that a large amount of manual intervention was required to obtain the numbers.

The last column contains hyperlinked references. Now we see why so much manual work was required. The citations link out to two main types of information:

  1. Paragraphs of free text with numbers, somewhere in amongst it, like this example
  2. Infographic-style reports in PDF format, like this example

That’s more wrangling than I have time for just now; OK, so the Wikipedia table it is. Still a little more “wrangling” to get the data out of that HTML table.

edited 2014-09-24 based on comment from Rainer

library(XML)
library(ggplot2)
library(reshape2)

# get all tables on the page
ebola <- readHTMLTable("http://en.wikipedia.org/wiki/Ebola_virus_epidemic_in_West_Africa", 
                  stringsAsFactors = FALSE)
# thankfully our table has a name; it is table #5
# this is not something you can really automate
head(names(ebola))
# [1] "Ebola virus epidemic in West Africa"          
# [2] "Nigeria Ebola areas-2014"                     
# [3] "Treatment facilities in West Africa"          
# [4] "Democratic Republic of Congo-2014"            
# [5] "Ebola cases and deaths by country and by date"
# [6] "NULL"

ebola <- ebola$`Ebola cases and deaths by country and by date`

# again, manual examination reveals that we want rows 2-71 and columns 1-3
ebola.new <- ebola[2:nrow(ebola), 1:3]
colnames(ebola.new) <- c("date", "cases", "deaths")

# need to fix up a couple of cases that contain text other than the numbers
ebola.new$cases[nrow(ebola.new)-43]  <- "759"
ebola.new$deaths[nrow(ebola.new)-43] <- "467"

# get rid of the commas; convert to numeric
ebola.new$cases  <- gsub(",", "", ebola.new$cases)
ebola.new$cases  <- as.numeric(ebola.new$cases)
ebola.new$deaths <- gsub(",", "", ebola.new$deaths)
ebola.new$deaths <- as.numeric(ebola.new$deaths)

# the days in the dates are encoded 1-31
# are we there yet? quick and dirty attempt to reproduce Wikipedia plot
ebola.m <- melt(ebola.new)
ggplot(ebola.m, aes(as.Date(date, "%e %b %Y"), value)) + 
       geom_point(aes(color = variable)) + 
       coord_trans(y = "log10") + xlab("Date") + 
       labs(title = "Cumulative totals log scale") + 
       theme_bw() 
ebola

First attempt to reproduce a plot from the Wikipedia page

Result: on the right, click for full-size.

We can complain: if only the WHO, CDC and other organisations provided data as a web service. Or even as files in CSV format. Anything but PDF. But right now at least, they do not. So hats off to the heroic efforts of the Wikipedian so-called “data janitors“. From that article:

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone

Surprising? Not to scientists (who don’t qualify the profession with the redundant word “data”). “Key hurdle to insights”, says the article title? Not really – just part and parcel of the job. I’d even argue that effective wrangling is where most of the skills are required. So perhaps, think twice before belittling peoples extensive skill sets with terms like “janitor”. You might need them to wrangle your data some day.

13 thoughts on “Ebola, Wikipedia and data janitors

  1. Pingback: Ebola, Wikipedia and data janitors | Tools and ...

  2. Chris Stubben

    You can also get the Ebola data at https://github.com/cmrivers/ebola. And the Data link on the Ebola Portal at WHO says “Data will be made available for open access in the coming days. All data will be made available via open format downloads as well as through an open access API” – sounds nice, but who knows how long that will take?

    1. nsaunders Post author

      Yay for Github, again! Good to know that someone at WHO is looking at their portal but as you say, when? I hope for the day when data provision is the first thing organisations think about, not the afterthought that it often is right now.

  3. Pingback: Ebola, Wikipedia and data janitors | R for Jour...

  4. Kai

    Mr. Saunders, thank you very much for the crystal clear illustration of all you processed the data.

    I have one question regarding the way you handled the date data. Shouldn’t we use “%d %b %Y” in line 37? By the way, when I execute the following codes:

    as.Date(ebola.m$date, ‘%d %b %Y’)

    I can only have a bunch of NA values. (I have tried other formats of dates using the as.Date command and all worked well. Strangely, it did not work in your case.)

    1. nsaunders Post author

      I’m happy that the post was useful!

      The format string for the day has to be “%e” not “%d”, because single digit days are preceded by a space, not by a zero – for example ” 1 Jun 2014″, not “01 Jun 2014”. I always refer to this reference to remind myself of the formats.

      1. Kai

        Thank you for the reference, Mr. Saunders! That helps!

        However, if I print out “ebola.m$date”, all the date values in characters lost the first space:

        > ebola.m$date
          [1] "21 Sep 2014" "17 Sep 2014" "14 Sep 2014" "10 Sep 2014" "7 Sep 2014" 
          [6] "3 Sep 2014"  "31 Aug 2014" "25 Aug 2014" "20 Aug 2014" "18 Aug 2014"
         [11] "16 Aug 2014" "13 Aug 2014" "11 Aug 2014" "9 Aug 2014"  "6 Aug 2014" 
         [16] "4 Aug 2014"  "1 Aug 2014"  "30 Jul 2014" "27 Jul 2014" "23 Jul 2014"
         [21] "20 Jul 2014" "17 Jul 2014" "14 Jul 2014" "12 Jul 2014" "8 Jul 2014" 
         [26] "6 Jul 2014"  "2 Jul 2014"  "30 Jun 2014" "22 Jun 2014" "20 Jun 2014"
         [31] "17 Jun 2014" "16 Jun 2014" "15 Jun 2014" "10 Jun 2014" "6 Jun 2014" 
        .......
        

        This is really annoying. (I’m sorry for troubling you with such error messages.)

      2. Kai

        And strangely:

        > test = '11 Sep 2014'
        > as.Date(test, "%e %b %Y")
        [1] NA
        

        This is making me sleepless :-(

    2. Kai

      After extensive searches online, I finally solved it by changing the system locale from Chinese to English. Now ‘%b’ works just fine.
      … talking about the cons of working on a non-Engilsh system

      Thank you again for your patient explanation!

  5. Rainer Hurling

    Many thanks for your script. You showed us an easy way to get epidemic Ebola data from Wikipedia. Because the data source (table at Wikipedia) is growing from time to time, it would be interesting to automatically read the data, even if the table is longer now. As far as I tried, only small changes on two places are needed for this:

    line 21:
    ebola.new <- ebola[-1, 1:3] # The '-1' uses all but the first entry.

    lines 25+26:
    ebola.new$cases[dim(ebola.new)[1]-43] <- "759"
    ebola.new$deaths[dim(ebola.new)[1]-43] <- "467"

    New data is filled into the table from top, so your entry 'no. 27' is not valid any more. Looking from the end of table backwards should solve this …

    Thanks again,
    Rainer Hurling

    1. nsaunders Post author

      Great, thanks! Yes, I noticed the updates but have not had time to improve the code, so thanks for your contribution. I’ve edited the code to use nrow() in both places.

      1. Rainer Hurling

        I just noticed your approach using nrow() instead of [-1,] and dim()[1] to read in a table of unknown length. It is much easier to understand now, than with my approach :)

Comments are closed.