The curse of amino acid alphabets

So, I’m debugging some Perl that uses weight matrices to score sites in a protein sequence. The matrix is stored as a hash of arrays, with each $row of $weights{‘row’} being a row number and $weights{‘row’}{$row}{$aa} being a weight for amino acid $aa in row $row. I’ve tested a few sequences without incident and moved onto whole genomes, when suddenly the script begins to spew:

Use of uninitialized value in addition (+) at MyModule.pm line 819


The offending line is the one that increments the score with the appropriate value of $aa in $row:

$score += $weights{‘row’}{$row}{$aa};

Alright – so the warning tells us that we’re missing either $score, $weights{‘row’}{$row} or $weights{‘row’}{$row}{$aa}. I try out a few if() statements to test for the existence of each of these and to tell me what’s happening should they not exist. Tell me why you’re struggling with $aa, is what I say. And I get lines like this:

PROBLEM: gi|15839443|ref|NP_334480.1| T 904 RVATXPG X

All becomes clear. The weight matrix is built using the 20 standard amino acid characters. The sequence that we want to score contains characters other than these 20. Hence, $weights{‘row’}{$row}{$aa} doesn’t exist if $aa is B, J, O, U, X or Z. All of which by the way are now standard IUPAC amino acid symbols.

The solution for now is simply not to score these troublesome sequences. For me this highlights a big problem when writing code for biology. You can write good, clean code, but it has to be able to handle many variations in the input data – most of which you become aware of by trial and error. When you grab stuff from databases, you’re just never quite sure what you’re going to get.
We’re back to data standards again, aren’t we.

3 thoughts on “The curse of amino acid alphabets

  1. Heh, I can sympathize … I remember when I got tripped up for a while by not expecting ‘X’ residues in SwissProt. I had essentially the same problem but using a Python dictionary. It’s a recurring theme for bioinformatics databases, web services (eg validating CGI or Javascript user inputs) etc … the moment you assume something about the sanity of your input data, something will break :).

    On a related but mildly tangential note … anyone known if the Clustal alignment file format has any strict definition somewhere ? Lot’s of software other than ClustalW claims to produce “Clustal format”, but usually it is output with different headers (eg “MUSCLE” instead of “CLUSTAL”) .. and some parsers can’t handle that gracefully (eg Biopython’s Bio.Clustalw). I guess whether the format has a published definition or not, these ‘pseudo-Clustal’ formats aren’t going away, so it should be up to Biopython to make their parser (optionally) more forgiving. Hmmm .. eventually I’ll get this to the Biopython mailing list (once I’ve actually got around to writing the patch).

  2. There sure are a lot of “Clustal” variants. I’ve seen with/without the line count at the end of each line and with/without the consensus line.
    The EBI has a page of formats but I don’t know if these are strict definitions. Bioperl’s Align::IO::clustalw looks for “CLUSTAL” at the start of the first line, then parses alignment lines either with or without the trailing number or consensus line.

  3. Pingback: The curse of amino acid alphabets revisited « What You’re Doing Is Rather Desperate

Comments are closed.