When your tools are broken, just change the data

October 10, 2019August 7, 2020 / nsaunders

Update August 7 2020
The gene symbol renaming is now official. Here’s the publication (not open access, should be), coverage at The Verge and more coverage at The Register. The latter with quotes from me.

It’s been 3 years since we last visited that old favourite recurring topic, data corruption by Excel. Specifically, the unwanted auto-conversion of identifiers that look like dates, e.g. SEPT1, to – well, dates.

Here’s a new twist – well, a two year-old twist in fact, as I don’t keep up to date with this field any longer:

TIL that SEPT genes were renamed in 2017 to SEPTIN genes by the HGNC https://t.co/2UadZUMLCS pic.twitter.com/jCo0Hcf6sf

— mdziemann (@mdziemann) October 8, 2019

Yes, in 2017 the HGNC decided that the solution to this long-standing issue is to rename the offending genes to prevent the auto-conversion. I’m yet to determine whether anything more came of the proposal.

It is I suppose a practical suggestion that will work. The newsletter states that:

Our initial consultation with the research community publishing on these genes had very mixed results

I bet it did. However, given that ongoing consultation with the research community about the inappropriate use of software has had essentially no results in 15+ years, perhaps it is the most effective solution to the problem.

Can random forest provide insights into how yeast grows?

June 26, 2019June 26, 2019 / nsaunders / 1 Comment

I’m not saying this is a good idea, but bear with me.

Continue reading →

50% bananas

May 9, 2018June 22, 2018 / nsaunders

Today in “blog posts that have spent two years in the draft folder” – “Humans are 50% banana.”

“Humans are 50% banana.”

Perhaps you have heard this statement, or one like it. It seems to be widely-quoted. As an example it’s hard to go past this article from UK tabloid The Mirror which, in addition to the banana, also informs us that “the entire internet weighs about the same as one large strawberry”. I don’t even know where to begin with that one.

A couple of years ago, between jobs and with time on my hands, I thought I’d go in search of the source for this factoid.

Continue reading →

Twitter Coverage of the Lorne Genome Conference 2017

February 16, 2017March 16, 2018 / nsaunders / 2 Comments

Things to know about Lorne in the state of Victoria, Australia.

It’s situated on the Great Ocean Road, a major visitor attraction and a great way to see the scenic coastline of the region
It’s home to a number of life science conferences including Lorne Genome 2017

This week’s project then: use R to analyse coverage of the 2017 meeting on Twitter. I last did something similar for the ISMB meeting in 2012. How things have changed. Back then I prepared PDF reports using Sweave, retrieved tweets using the twitteR package and struggled with dates and time when plotting timelines. This time around I wrote RMarkdown in RStudio, tried out the newer rtweet package and, thanks to packages such as dplyr and lubridate, the data munging is all so much cleaner and simpler.

So without further ado here is the Github repository.

The report examines several aspects of the conference coverage under the broad headings of timeline, users, networks, retweets, favourites, quotes, media and text.

Better living through informatics: in search of koalas

January 20, 2015January 20, 2015 / nsaunders / 2 Comments

In 2015, I’d like to write, think and do more about things that I care about. One of those things happens to be the koala. Now, this being a blog about bioinformatics and computational biology, I can’t just start writing about any old thing that takes my fancy…I guess. So in this post I’m going to stretch the definition to include ecological informatics and tell you the story of how I achieved a long-held ambition using one of my favourite online resources, The Atlas of Living Australia. And then we’ll wrap up with a quick survey of the (sorry) state of marsupial genomics.
Continue reading →

Finally, NCBI Genomes recognises Archaea*

August 27, 2014August 27, 2014 / nsaunders

I’ve been complaining about this for years. They fixed it. The NCBI have reorganised their genomes FTP site and finally, Archaea are not lumped in with Bacteria.

GenBank: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/
RefSeq:  ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/

Archaea are still included in the ASSEMBLY_BACTERIA directory; hopefully that’s next on the list.

[*] to be fair, they’ve always recognised Archaea – just not in a form that makes downloads convenient

Venn figures go wrong

August 13, 2014August 13, 2014 / nsaunders / 3 Comments

6-way Venn banana

I thought nothing could top the classic “6-way Venn banana“, featured in The banana (Musa acuminata) genome and the evolution of monocotyledonous plants.

That is until I saw Figure 3 from Compact genome of the Antarctic midge is likely an adaptation to an extreme environment.

5-way Venn roadkill

What’s odd is that Figure 2 in the latter paper is a nice, clear R/ggplot2 creation, using facet_grid(), so someone knew what they were doing.

That aside, the Antarctic midge paper is an interesting read; go check it out.

This led to some amusing Twitter discussion which pointed me to ^*A New Rose : The First Simple Symmetric 11-Venn Diagram.

[*] +1 for referencing The Damned, if indeed that was the intention.

BLATting the internet: the most frequent gene?

January 24, 2014January 24, 2014 / nsaunders / 1 Comment

I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.

Laura DeMare asks: what was the most-hit gene?

Most hit gene? APOE? MT @GenomeBrowser We BLATed the Internet! DNA sequences from 40 billion webpages mapped to hg19 http://t.co/D5xLMBtpYb

— Laura DeMare (@ldemare) January 23, 2014

Continue reading →

Why bioinformaticians hate the “traditional journal article”

November 6, 2013November 6, 2013 / nsaunders / 10 Comments

This bioinformatician, at least. Hate is a strong word. Perhaps “dislike” is better.

Short answer: because you can’t get data out of them easily, if at all. Longer answer:
Read the rest…

Using the Ensembl Variant Effect Predictor with your 23andme data

June 4, 2013September 25, 2013 / nsaunders / 1 Comment

I subscribe to the Ensembl blog and found, in my feed reader this morning, a post which linked to the Variant Effect Predictor (VEP). The original blog post, strangely, has disappeared.

Not to worry: so, the VEP takes genotyping data in one of several formats, compares it with the Ensembl variation + core databases and returns a summary of how the variants affect transcripts and regulatory regions. My first thought – can I apply this to my own 23andme data?

Read the rest…

What You're Doing Is Rather Desperate

Notes from the life of a [data] scientist

genomics

When your tools are broken, just change the data

Can random forest provide insights into how yeast grows?

50% bananas

Twitter Coverage of the Lorne Genome Conference 2017

Better living through informatics: in search of koalas

Finally, NCBI Genomes recognises Archaea*

Venn figures go wrong

BLATting the internet: the most frequent gene?

Why bioinformaticians hate the “traditional journal article”

Using the Ensembl Variant Effect Predictor with your 23andme data