If you must send me an Excel spreadsheet…

…please, try to follow these simple guidelines.

1. Don’t bother to format the cells
Where possible, I will not open your spreadsheet in a spreadsheet application. If I do, it will be only to marvel at the horror, then export it as rapidly as possible to a delimited text file. I do not care about the font, the font size or the font weight. I do not care whether there are grid lines around the cells. I especially do not care about cells which you have highlighted using some arbitrary (and unexplained) colour scheme.

2. No multiple tables
If you include multiple “tables” on one sheet, separated by blank rows, there is a good chance that I will not notice them. If you include multiple tables on multiple “sheets”, there is an excellent chance that I will not notice them.

3. Be consistent
If you must use confusing, abbreviated terms for your row and column names, at least keep them consistent. When you suddenly switch from “Patient ID” to “MCO_ID”, or from “Tissue Bank ID” to “TB ID” but leave everything else the same, I (and my software) assume that you’re talking about something different.

4. Yes/No = 1/0
Would it kill you to think as hard about the type and structure of your data as the data itself? If your variable takes one of two values in a “yes/no” fashion, the best representation is 1 or 0. That goes for “wt/mut” too. If you must use “Y/N”, don’t suddenly switch to “Yes/No” (or case-sensitive variations thereof) just because you feel like it.

5. If it doesn’t exist, it shouldn’t be there
Just leave the cell blank. I don’t want to see “n/a”, “NA”, “?”, “-” or anything else.

6. What belongs with what?
Have you noticed that certain bits of your data belong with other bits? For example, you can take several samples from a patient and do several experiments using those samples? Perhaps you’ve heard the term “relational data”? Well, that’s what it means.

If you could find a way to highlight those relations in your spreadsheet (no, not using coloured cells please), it would really help. On second thoughts: why don’t you come and see me before collecting your data? We’ll design a database together. You might even realise why I hate your stupid spreadsheets so much.

23 and me – yes, me – part 2

Sample journey and arrival


Spitting across the Pacific

My tube of spit arrived at the lab on May 19. Six days door-to-door via Guangzhou, Anchorage and Memphis to LA.


23andMe raw data menu

On arrival, a confirmatory email: “The spit sample you recently submitted to 23andMe for the person listed above has been received by the laboratory and is now pending analysis; the process usually takes 6-8 weeks. You will receive another email notification from us as soon as the data for this sample are ready to be accessed through your 23andMe account.”

In the meantime, there’s plenty to explore at the 23andMe website. Anyone can create a demo account, which allows you to explore anonymous sample data to get a feel for what you’ll see when your own sample is processed. Naturally, I’m most excited by the options to browse and download raw data. You can also participate in around 20 health and genetics surveys which are a good way to kill time, although not many of them provide instant personal gratification.

Next update – some time in July.

23 and me – yes, me – part 1

Until recently, I was not even aware that there is a DNA day. Nor can I tell you exactly when and where I noticed that 23andMe, the personal genomics company, launched a sale to celebrate the day – I imagine it flashed by on Twitter or FriendFeed. I can tell you that like many others I decided that finally, I could justify the expense, signed up (with around 15 minutes to spare – thanks to the 17 hour Sydney/California time difference) and I’m now waiting for sample arrival and processing.

I thought it might be interesting to blog the experience and provided that I don’t discover anything disturbing, I’ll share some of my data here. Related posts will be tagged with “23andme” and here is part 1 which covers sign-up, delivery, sample collection and return.
Read the rest…

Beware of rogue header files (Bioconductor installation)

Just a short note concerning a “gotcha”.

As I have many times before, I opened an R console on my newly-upgraded (to lucid 10.04) Ubuntu machine, typed source(“http://bioconductor.org/biocLite.R”) and began a Bioconductor install with biocLite(). Only this time, I saw this:

Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared library 
  /home/sau103/R/i486-pc-linux-gnu-library/2.11/affyio/libs/affyio.so: undefined symbol: egzread
ERROR: loading failed
* removing ‘/home/sau103/R/i486-pc-linux-gnu-library/2.11/affyio’

A quick email to the Bioconductor mailing list put me in touch with the very helpful Martin Morgan, who suggested that I check my zlib libraries. Sure enough, the rogue “egzread” was found in /usr/local/include/zlibemboss.h, along with a second zlib.h file, in addition to /usr/include/zlib.h.

grep egz /usr/local/include/zlibemboss.h
> #define gzread egzread

I moved the rogue zlib.h out of /usr/local/include and order was restored.

So in summary, watch out when installing EMBOSS on Ubuntu – it seems to mess with things that it should not.

MongoDB and Ubuntu 10.04

Fans of MongoDB and Ubuntu, rejoice. Installation just got easier, with the appearance of mongodb in the Ubuntu repositories.

However – the latest version in lucid is 1.2.2, whereas you want the very latest, 1.4.2. All the instructions are here. As usual, it’s just a case of adding a line to your sources list, importing a GPG key and then:

sudo apt-get update
sudo apt-get install mongodb-stable

Configuration lives in /etc/mongodb.conf, databases live in /var/lib/mongodb, logging goes to /var/log/mongodb/*.log.

Poor reporting: the anti-freeze that wasn’t

Generally, I don’t cover “mainstream” science reporting, but this is too poor to let it pass.

Nature Genetics features a fascinating article about the properties of haemoglobin from the extinct woolly mammoth. Briefly, the researchers sequenced DNA encoding haemoglobin subunits from a sample of mammoth bone and compared it with that of modern elephants. They then altered the modern elephant DNA sequence to match that of the mammoth, expressed mammoth and elephant protein in E. coli and compared the oxygen affinity of each protein. Their conclusion: the amino acid substitutions in mammoth haemoglobin result in an enhanced ability to release oxygen to tissues at low temperature.

You will not find the words “anti-freeze” anywhere in the article. Bear that in mind, as we survey the reporting of this story by various news outlets:

Read the rest…