Taking steps (in XML)

So the votes are in:

I thank you, kind readers. So here’s the plan: (1) keep blogging here as frequently as possible (perhaps monthly), (2) on more general “how to do cool stuff with data and R” topics, (3) which may still include biology from time to time. Sounds OK? Good.

So: let’s use R to analyse data from the iOS Health app.

I own an iPhone. It comes with a Health app installed by default. Not being a big user of mobile apps, it was several months before I opened it and realised that it had been collecting data. Which makes me wonder what else the phone does without my knowledge…but back to the topic. It turns out that health data can be exported by tapping at top-right on the overview page, then choosing export.

This slideshow requires JavaScript.

This generates a compressed file, ios_health_export.zip. Upload it from your phone to your destination of choice; I went with Google Drive.

Being Apple, I’d assumed that the contents might be some hideous proprietary binary format but in fact unzipping the file reveals a directory, apple_health_export, in which reside two XML files. The larger export.xml contains your health data.

Records in the XML file consist of lines that specify the record type (measurement), source, three timestamps for creation, start and end, and the value of the measurement. Most of my records are step counts, which look like this:

<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Health" unit="count" creationDate="2014-09-24 09:25:06 +1100" startDate="2014-09-23 18:01:22 +1100" endDate="2014-09-23 18:01:24 +1100" value="9"/>

And so to R. In the past I would have used the XML package but in my ongoing effort to convert to the “tidyverse”, I’ll try xml2 instead. We’ll use purrr too for reasons that will become apparent, ggplot2 for plotting and dplyr because it is awesome.

Reading in the file could not be easier:

library(xml2)
library(purrr)
library(ggplot2)
library(dplyr)

health_data <- read_xml("export.xml")

Nor could extracting the records that contain step counts. We use an xpath expression, then pipe the result to purr’s mapping functions to go straight from XML attributes to a data frame, as described here.

steps <- xml_find_all(health_data, ".//Record[@type='HKQuantityTypeIdentifierStepCount']") %>% map(xml_attrs) %>% map_df(as.list)

glimpse(steps)
Observations: 188,677
Variables: 9
$ type          <chr> "HKQuantityTypeIdentifierStepCount", "HKQuantityTypeIdentifierStepCount", "HKQuantityTypeIdentifierStepCount", "HKQuantityTypeIdentifierStepCount", ...
$ sourceName    <chr> "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health"...
$ unit          <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "cou...
$ creationDate  <chr> "2014-09-24 09:25:06 +1100", "2014-09-24 09:25:06 +1100", "2014-09-24 09:25:06 +1100", "2014-09-24 09:25:06 +1100", "2014-09-24 09:25:06 +1100", "20...
$ startDate     <chr> "2014-09-23 17:58:58 +1100", "2014-09-23 17:59:08 +1100", "2014-09-23 17:59:18 +1100", "2014-09-23 17:59:28 +1100", "2014-09-23 17:59:58 +1100", "20...
$ endDate       <chr> "2014-09-23 17:59:03 +1100", "2014-09-23 17:59:13 +1100", "2014-09-23 17:59:23 +1100", "2014-09-23 17:59:33 +1100", "2014-09-23 18:00:03 +1100", "20...
$ value         <chr> "12", "5", "17", "1", "14", "4", "10", "2", "4", "2", "9", "7", "4", "9", "7", "6", "11", "13", "6", "8", "5", "8", "6", "9", "1", "7", "13", "6", "...
$ sourceVersion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ device        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

To illustrate an example analysis, let’s aggregate steps to a monthly count and plot counts by month. We’ll assume that startDate is a proxy for day (i.e. I’m not walking at midnight so steps don’t straddle day boundaries). We’ll also assign the monthly count to the first day of the month, to avoid having to figure out what number day ends the month :)

So, to recode step count as an integer, convert the start date to a date object, summarise by month and plot, let’s see dplyr in action:

steps %>% select(startDate, value) %>%
group_by(Date = as.Date(paste(substr(startDate, 1, 7), "01", sep = "-")))
%>% summarise(count = sum(as.numeric(value))) %>%
ggplot(aes(Date, count)) + geom_col(fill = "skyblue3") + theme_bw() + labs(y = "monthly step count", title = "Steps by month September 2014 - January 2017 as measured by iOS")

Result:

ios_steps

As to how accurate the counts are: that’s for another day.