Data discovery: seasonal speed

January 23, 2024January 23, 2024 / nsaunders

Just writing this one quickly as it’s been hanging around my browser tabs for weeks…

I wrote Taking steps (in XML) almost 7 years ago and once in a while, I still grab Apple Health data from my phone and play around with it in R for a few minutes. Sometimes, curve fitting to a cloud of points generates a surprise.

library(tidyverse)
library(xml2)
theme_set(theme_bw())

health_data <- read_xml("~/Documents/apple_health_export/export.xml")

ws <- xml_find_all(health_data, ".//Record[@type='HKQuantityTypeIdentifierWalkingSpeed']") %>% 
    map(xml_attrs) %>% 
    map_df(as.list)

ws %>% 
    mutate(Date = ymd_hms(creationDate), 
                  value = as.numeric(value)) %>% 
    ggplot(aes(Date, value)) + 
    geom_point(size = 1, alpha = 0.2, color = "grey70", fill = "grey70") + 
    geom_smooth() + 
    labs(y = "Walking speed (km/h)", 
    title = "Walking speed data", 
    subtitle = "Apple Health 2020 - 2023")

Result:

Huh. Looks seasonal. Looks faster in the (southern) winter. Has that been reported before? Sure has.

It didn’t impress everyone but I thought it was interesting.

Gene names, data corruption and Excel: the final chapter?

October 27, 2023 / nsaunders

I suppose that after:

Gene name errors and Excel: lessons not learned (2012)
Data corruption using Excel: 12+ years and counting (2016)
When your tools are broken, just change the data (2019-20)
and Gene names, data corruption and Excel: a 2021 update (2021)

it would be remiss of me not to mention: Microsoft fixes the Excel feature that was wrecking scientific data.

Is it really fixed though? Users have to know that the feature exists, find it and toggle a checkbox. Given that the users most “at risk” probably open CSV files in Excel by default simply by clicking on them…I’m not optimistic.

Still, as Mark said:

19 years late, but better than never 😁
— Mark Ziemann🌈🌻 (@mdziemann) October 24, 2023

Price’s Protein Puzzle: 2023 update

July 26, 2023 / nsaunders

One of the joys (?) of having been online for…quite some time now…is watching topics reappear every few years or so.

What is the longest coherent word or phrase present in the amino acid sequence of a real protein?
— Dr. Caroline Bartman (@Caroline_Bartma) July 21, 2023

Yes, it’s Price’s Protein Puzzle which I last wrote about back in 2019. The good news is that my code still runs, so I’ve updated the results of an English word search versus the UniProt Reviewed (Swiss-Prot) protein database. Just for fun I threw in a few other languages too.

So what’s new?

Continue reading →

The “curse of the bye” revisited

July 10, 2023July 10, 2023 / nsaunders

A while ago we looked at Geelong and the curse of the bye. And since the AFL media have outdone themselves this year with “curse of the bye” articles: see for example here, here, here and here, I decided to revisit the topic in more depth.

If you like that kind of thing head over to the report at Github. It has lots of charts like this one.

Executive summary: once you take into account scheduling and expected results, there’s little if any evidence for significantly more losses coming off a bye round. I doubt that will prevent the same spate of articles next season.

Has your knowledge stopped updating?

January 27, 2023January 27, 2023 / nsaunders / 6 Comments

Some years ago I read an article – I forget where – describing how our general knowledge often becomes frozen in time. Asked to name the tallest building in the world you confidently proclaim “the Sears Tower!”, because for most of your childhood that was the case – never mind that the record was surpassed long ago and it isn’t even called the Sears Tower anymore. From memory the example in the article was of a middle-aged speaker who constantly referred to a figure of 4 billion for the human population – again, because that’s what he learned in school and had never mentally updated.

Is this the case with programming too? Oh yes – as I learned today when performing the simplest of tasks: reading CSV files using R.

Continue reading →

Editing metadata in trail camera images using R, magick and exiftool

October 25, 2022October 26, 2022 / nsaunders / 2 Comments

I have a new hobby: camera traps, also known as trail cameras. Strapped to trees in my local bushland they sit in wait, firing automatically when triggered by a passing animal. Once in a while, something quite magical happens.

The camera model I chose is the Campark T85 which for me, had the right combination of features and price point. One useful feature is the ability to transfer images and video to a phone wirelessly (albeit through a rather clunky phone app). Unfortunately, images retrieved in this way have one major flaw: an almost-complete absence of metadata. There is no GPS in the camera of course, but the EXIF data does not include the date/time of the image, nor the camera make.

With a little research, I found a way to add this information to the images later using R and some additional software named exiftool. Here’s how I did it.

Continue reading →

Using R to detect the pressure wave from the 2022 Hunga Tonga eruption in personal weather station data

March 29, 2022March 29, 2022 / nsaunders / 1 Comment

It seems like an age ago, but in fact it was only mid-January 2022 when this happened:

The satellite imagery from the Hunga Tonga eruption is unreal. Direct your attention to the lower right. The eruption then shock wave is simply incredible. pic.twitter.com/OTLCgyEozQ
— Taylor Trogdon (@TTrogdon) January 15, 2022

Wow. Now, pause for a moment and try to recall the last time you read any news about Tonga since the event.
The eruption sent an atmospheric pressure wave, clearly visible in this imagery, around the world. Friends online reported that this was detected by their personal weather stations (PWS) which made me wonder: was the wave apparent in online weather station data and can it be visualized using R?

The answers are yes and yes again.

Continue reading →

Using R/fitzRoy to ask: how many times a V/AFL team with the same lineup has played together?

March 28, 2022March 28, 2022 / nsaunders

If you sit in the intersection of “likes Australian Rules football / finds sport statistics interesting / is on Twitter”, you’ve probably come across Swamp. One of his recent tweets tells us that:

No V/@AFL premiership winning lineup have all played together in another V/@AFL match, there has always been at least one person missing

All MELB 2021 premiership players are still at the club in 2022

@melbournefc
— Swamp (@sirswampthing) March 16, 2022

You may go on to ask: has any team lineup from one of the almost 16 000 recorded games played together again in another game? And if so, how often?

The answer to that question is at once surprising, less surprising when you think about it, and quite easy to figure out using the ever-helpful fitzRoy package.

Continue reading →

Enhancement of old colour photographs using Generative Adversarial Networks

December 23, 2021March 15, 2022 / nsaunders / 1 Comment

It’s almost Christmas, I haven’t posted anything in a while and I see that WordPress has an Image Compare feature, so let’s have some colourful fun.

When I’m not at the computer writing R code, I can often be found at the computer processing photographs. Or at the computer browsing Twitter, which is how I came across Stuart Humphryes, a digital artist who enhances autochromes. Autochromes are early colour photographs, generated using a process patented by the Lumière brothers in 1903. You can find and download many examples of them online. Stuart uses a variety of software tools to clean, enhance and balance the colours, resulting in bright vivid images that often have a contemporary feel, whilst at the same time retaining the somewhat “dreamy” quality of the original.

Having read that one of his tools uses neural networks, I was keen to discover how easy it is to achieve something similar using freely-available software found online. The answer is “quite easy” – although achieving results as good as Stuart’s is somewhat more difficult. Here’s how I went about it.

Continue reading →

Gene names, data corruption and Excel: a 2021 update

August 3, 2021 / nsaunders / 1 Comment

It’s an old favourite of this blog, isn’t it. We had Gene name errors and Excel: lessons not learned (2012). Followed by Data corruption using Excel: 12+ years and counting (2016). Perhaps most depressingly of all, the conclusion of the trilogy, When your tools are broken, just change the data (2019-20).

Well, I’m happy (?) to see the publication of the latest instalment, inspired in part by the title of my first post: Gene name errors: Lessons not learned, from Mark Ziemann’s group. Here’s the accompanying Twitter thread. Summary: it’s even worse than we thought.

Tagging this one with the R tag, because the group are publishing monthly RMarkdown reports here. Congratulations Nature Communications!

As a footnote: you don’t escape this kind of thing when you leave bioinformatics. I listened to a colleague in a data science meeting yesterday declare that “we won’t be putting anything into production that relies on data supplied to us as spreadsheets”.

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

Data discovery: seasonal speed

Gene names, data corruption and Excel: the final chapter?

Price’s Protein Puzzle: 2023 update

The “curse of the bye” revisited

Has your knowledge stopped updating?

Editing metadata in trail camera images using R, magick and exiftool

Using R to detect the pressure wave from the 2022 Hunga Tonga eruption in personal weather station data

Using R/fitzRoy to ask: how many times a V/AFL team with the same lineup has played together?

Enhancement of old colour photographs using Generative Adversarial Networks

Gene names, data corruption and Excel: a 2021 update