Hell is other people’s data

“Take a look at the TP53 mutation database“, my colleague suggested. “OK then, I will”, I replied.

I present what follows as “a typical day in the life of a bioinformatician”.

I click through to the Download page. At first sight, it’s reasonably promising. Excel files – not perfect, but beats a PDF – and ooh, TXT files. OK, I grab the “US version” (which uses decimal points, not commas for decimal numbers) of the uncurated database:

wget http://p53.fr/TP53_database_download/TP53_tumor_database/TP53_US/UMDTP53_all_2012_R1_US.txt.zip
unzip UMDTP53_all_2012_R1_US.txt.zip
#  inflating: UMDTP53_all_2012_R1_US.txt  
#  inflating: ._UMDTP53_all_2012_R1_US.txt

Wait now. What’s that weird hidden file, starting with a dot?

less ._UMDTP53_all_2012_R1_US.txt 
# "._UMDTP53_all_2012_R1_US.txt" may be a binary file.  See it anyway? 
# no, I don't want to do that
strings ._UMDTP53_all_2012_R1_US.txt 

At this point, alarm bells are ringing. I’m thinking “Mac user”. Don’t get me wrong; I own a Macbook Air myself, I see smart people doing good work with Macs. But in my experience, biologist + Mac + text files = trouble.

I push on and check the number of records in the main text file.

wc -l UMDTP53_all_2012_R1_US.txt
# 0 UMDTP53_all_2012_R1_US.txt

line endings

Mmm – oh, ^M

Zero lines. That can’t be right. We look at the file using less.

Aha, the old “line endings” issue. Easy enough to fix with the classic utility dos2unix – in this case, called as mac2unix:

mac2unix UMDTP53_all_2012_R1_US.txt > /tmp/UMDTP53_all_2012_R1_US.txt
# dos2unix: Skipping binary file UMDTP53_all_2012_R1_US.txt

Well that’s odd. Even odder is that this works.

cat UMDTP53_all_2012_R1_US.txt | mac2unix > /tmp/UMDTP53_all_2012_R1_US.txt
# on we go; overwriting the original since we can always get it from the zip
mv /tmp/UMDTP53_all_2012_R1_US.txt UMDTP53_all_2012_R1_US.txt
wc -l UMDTP53_all_2012_R1_US.txt
# 36248

The web page said there should be 36 249 records…but the file looks OK. Right, let’s get on and parse this file. It’s tab-delimited, I like Ruby, so I try to do the right thing and use a library – in this case, the Ruby CSV library. First, just a run-through to see that it all works as expected:

require 'csv'
f = "UMDTP53_all_2012_R1_US.txt"
CSV.foreach(f, :headers => true, :col_sep => "\t") do |row|
    puts row['Name']
# prints lots of names and then...
# ArgumentError: invalid byte sequence in UTF-8
# 	from /usr/lib/ruby/1.9.1/csv.rb:1855:in `sub!'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1855:in `block in shift'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1849:in `loop'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1849:in `shift'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1791:in `each'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1208:in `block in foreach'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1354:in `open'
# 	from /usr/lib/ruby/1.9.1/csv.rb:1207:in `foreach'

Holy Molès

Holy Molès

Not all as expected, then. Colour me astonished. Frankly, this is why people resort to splitting on the delimiter instead of using libraries…anyway, search the web for that error and you’ll find a lot of confusing, conflicting and sometimes, just plain wrong explanations and advice. Let me save you the trouble. The Ruby CSV library expects UTF-8 encoding. This file is not encoded in UTF-8. So what is it? A surprisingly tricky question to answer.

First step: another visual examination using less, which shows some odd characters. The file contains a “Medline” column, so I search PubMed with the UID and see that author Mol<8F>s is supposed to be author Molès. I am not an expert in character encoding but when I post my frustration to Twitter, I find someone who is:

So tell CSV to transcode from macRoman to UTF-8:

CSV.foreach(f, :headers => true, :col_sep => "\t", :encoding => "macRoman:UTF-8") do |row|
    puts row['Name']

All good! I can parse the file and start using the fields to do something useful. Once again the interesting part (analysis) takes minutes, getting to the analysis takes hours or days.

It’s always tempting to say “well all of these problems could be avoided if people only did [insert better approach here]”. The thing is, they don’t. Dealing with it is just part of the job and having the skills to deal with it, I think, deserves more recognition than it gets.

12 thoughts on “Hell is other people’s data

  1. A haiku I wrote not that long ago:

    “Damn that DOS file!
    I must fix the newline
    For it breaks the script”

    Although DOS file might not have been that accurate as my frustration was due to ^M line separators as well. I am new to poetry and haiku’s so be gentle :-)

    • Well, some would say we should all use UTF-8 encoded text files on Linux. Problem solved! But people are free to choose how they approach problems, so we have to deal with the consequences.

  2. Pingback: Hell is other people's data | Bioinformatics So...

  3. This sounds like a good use case for a Tabular Data Package file, published alongside the data files:

    Using the CSV Dialect Description Format, they could describe the format of each data file:

    delimiter: “\t”,
    lineTerminator: “\r”,
    encoding: “MacRoman” (note: not currently in CSVDDF)

    then the parser would know how to read in the data.

    That’s how it’s supposed to work in theory, anyway…

  4. Simply describing what the data is would be a big step forward. Just saying “text” doesn’t cut it. You need to say “Text encoded with Mac Roman, lines ended with Carriage Return, fields separate by tabs, tabs within fields quoted by a preceding backslash”. Or similar.

    This of course is why we need more explicit, self-describing formats. I think XML is pretty horrible but it is at least a foundation for creating formats that express data more cleanly.

  5. Pingback: Links 8/4/14 | Mike the Mad Biologist

  6. I just downloaded the file on my mac powerbook, unzipped it, and dragged it onto the excel icon in the doc. It popped right up. 36249 lines (first line is a header). Took less than 3 minutes from start to finish. :-)

  7. But in my experience, biologist + Mac + text files = trouble.

    This is not a Mac problem. It’s very specifically an “Excel on Mac” problem, and its buggy CSV export. But it’s not like this problem is limited to Mac, on the contrary (in fact, Mac line endings are the usual Unix-y LF, except in the aforementioned buggy Excel export). On Windows, the file endings would likewise have been wrong (CR+LF). The encoding would likewise have been non-Unicode (Windows-125*).

    As for `mac2unix` not working as expected – that’s quite simply a gotcha in the utility, because it ignores binary files (and it thinks the file is binary due to an invalid character in line 418, where this comes from is not obvious). Incidentally, the missing last record is due to a missing line termination in the last line. Again, a deficiency of the specific tool (Excel) rather than the operating system.

    So why am I making a big point out of this? Because I don’t really see what the alternatives are. Windows? Same issue. Linux? Too complicated for the biologist to learn, and lacking some tools (Illustrator …). A Mac, on the other hand, actually offers all the tools, both competent graphical tools and the necessary command line utilities to perform the grunt work. There’s simply no substitute for learning the tools, and this is where criticism should be levelled. But if we want to blame tools – and I love doing that – let’s blame Excel, and get its use for biologists banned: it’s simply irredeemably broken.

    • Good points. Not saying it’s a “Mac problem”, though. It’s a “people using Mac” problem :) or more generally, a “people using computers” problem. As you suggest, not something we can easily fix.

      To expand a little: in “my day” (the 1990s), the Mac was the favoured tool of molecular biologists because, they said, it was “easy to use”. Unfortunately, “easy to use” is often synonymous with “we don’t know what we’re doing but we can create things anyway”. And that is why I tend to roll my eyes when I see “created by biologists on a Mac”.

      In fact in many ways, I see bioinformatics as a response to 1990s molecular biology which was computer-illiterate, innumerate and devoid of statistics – but I think that is another blog post.

      (also, pretty sure the missing record is just down to them forgetting to subtract 1 for the header row :) )

Comments are closed.