Data discovery: seasonal speed

Just writing this one quickly as it’s been hanging around my browser tabs for weeks…

I wrote Taking steps (in XML) almost 7 years ago and once in a while, I still grab Apple Health data from my phone and play around with it in R for a few minutes. Sometimes, curve fitting to a cloud of points generates a surprise.

library(tidyverse)
library(xml2)
theme_set(theme_bw())

health_data <- read_xml("~/Documents/apple_health_export/export.xml")

ws <- xml_find_all(health_data, ".//Record[@type='HKQuantityTypeIdentifierWalkingSpeed']") %>% 
    map(xml_attrs) %>% 
    map_df(as.list)

ws %>% 
    mutate(Date = ymd_hms(creationDate), 
                  value = as.numeric(value)) %>% 
    ggplot(aes(Date, value)) + 
    geom_point(size = 1, alpha = 0.2, color = "grey70", fill = "grey70") + 
    geom_smooth() + 
    labs(y = "Walking speed (km/h)", 
    title = "Walking speed data", 
    subtitle = "Apple Health 2020 - 2023")

Result:

Huh. Looks seasonal. Looks faster in the (southern) winter. Has that been reported before? Sure has.

It didn’t impress everyone but I thought it was interesting.

Price’s Protein Puzzle: 2023 update

One of the joys (?) of having been online for…quite some time now…is watching topics reappear every few years or so.

Yes, it’s Price’s Protein Puzzle which I last wrote about back in 2019. The good news is that my code still runs, so I’ve updated the results of an English word search versus the UniProt Reviewed (Swiss-Prot) protein database. Just for fun I threw in a few other languages too.

So what’s new?

Continue reading

The “curse of the bye” revisited

A while ago we looked at Geelong and the curse of the bye. And since the AFL media have outdone themselves this year with “curse of the bye” articles: see for example here, here, here and here, I decided to revisit the topic in more depth.

If you like that kind of thing head over to the report at Github. It has lots of charts like this one.

Executive summary: once you take into account scheduling and expected results, there’s little if any evidence for significantly more losses coming off a bye round. I doubt that will prevent the same spate of articles next season.

Has your knowledge stopped updating?

Some years ago I read an article – I forget where – describing how our general knowledge often becomes frozen in time. Asked to name the tallest building in the world you confidently proclaim “the Sears Tower!”, because for most of your childhood that was the case – never mind that the record was surpassed long ago and it isn’t even called the Sears Tower anymore. From memory the example in the article was of a middle-aged speaker who constantly referred to a figure of 4 billion for the human population – again, because that’s what he learned in school and had never mentally updated.

Is this the case with programming too? Oh yes – as I learned today when performing the simplest of tasks: reading CSV files using R.

Continue reading

Using R to detect the pressure wave from the 2022 Hunga Tonga eruption in personal weather station data

It seems like an age ago, but in fact it was only mid-January 2022 when this happened:

Wow. Now, pause for a moment and try to recall the last time you read any news about Tonga since the event.
The eruption sent an atmospheric pressure wave, clearly visible in this imagery, around the world. Friends online reported that this was detected by their personal weather stations (PWS) which made me wonder: was the wave apparent in online weather station data and can it be visualized using R?

The answers are yes and yes again.

Continue reading

Using R/fitzRoy to ask: how many times a V/AFL team with the same lineup has played together?

If you sit in the intersection of “likes Australian Rules football / finds sport statistics interesting / is on Twitter”, you’ve probably come across Swamp. One of his recent tweets tells us that:

You may go on to ask: has any team lineup from one of the almost 16 000 recorded games played together again in another game? And if so, how often?

The answer to that question is at once surprising, less surprising when you think about it, and quite easy to figure out using the ever-helpful fitzRoy package.

Continue reading

Gene names, data corruption and Excel: a 2021 update

It’s an old favourite of this blog, isn’t it. We had Gene name errors and Excel: lessons not learned (2012). Followed by Data corruption using Excel: 12+ years and counting (2016). Perhaps most depressingly of all, the conclusion of the trilogy, When your tools are broken, just change the data (2019-20).

Well, I’m happy (?) to see the publication of the latest instalment, inspired in part by the title of my first post: Gene name errors: Lessons not learned, from Mark Ziemann’s group. Here’s the accompanying Twitter thread. Summary: it’s even worse than we thought.

Tagging this one with the R tag, because the group are publishing monthly RMarkdown reports here. Congratulations Nature Communications!

As a footnote: you don’t escape this kind of thing when you leave bioinformatics. I listened to a colleague in a data science meeting yesterday declare that “we won’t be putting anything into production that relies on data supplied to us as spreadsheets”.

How I resurrected my ancient PhD thesis using R/bookdown (and some other tools)

An ancient thesis

I’ve long admired the look of publications generated using the R bookdown package, and thought it would be fun and educational to publish one myself. The problem is that I am not writing a book and have no plans to do so any time soon.

Then I remembered that I’ve already written a book. There it is on the right. It’s called “Cloning, sequence analysis and studies on the expression of the nirS gene, encoding cytochrome cd1 nitrite reductase, from Thiosphaera pantotropha“. Catchy title, hey. It’s from my former life, as a biochemistry graduate turned reluctant molecular microbiologist. I believe there are 3 printed copies in existence: mine, one for the lab and one deposited in the university library.

That’s simple enough then Neil, you say, you just grab your digital files, copy/paste into RMarkdown files, do a bit of editing and you’re set. Here’s the thing.

There are no digital files.

There were, once. A collection of documents: Word, Powerpoint and JPEGs. I think they lived on a 100 MB zip drive for a while. At some point they were burned onto a CD. And at some other point, that CD became corrupted. And that was that. Like many (most?) people, I’d barely looked at the thesis since depositing a copy in the library anyway. It didn’t seem to matter much.

And then I grew older, and started looking at some of the documents in our family, and realising that in the event of accident or disaster, they’d be lost forever. So I started working on ways to digitally archive some of them. At some point my thoughts turned to that thesis, which took 4 years of my life. I wondered whether the university library had digitised it and if so, whether it might be available online. So far as I can tell, the answer is no. That seemed a shame.

So here, briefly, is the story of how I used R/bookdown and some other tools to resurrect that thesis.

Read the rest