So, I’m debugging some Perl that uses weight matrices to score sites in a protein sequence. The matrix is stored as a hash of arrays, with each $row of $weights{‘row’} being a row number and $weights{‘row’}{$row}{$aa} being a weight for amino acid $aa in row $row. I’ve tested a few sequences without incident and moved onto whole genomes, when suddenly the script begins to spew:
The offending line is the one that increments the score with the appropriate value of $aa in $row:
Alright - so the warning tells us that we're missing either $score, $weights{'row'}{$row} or $weights{'row'}{$row}{$aa}. I try out a few if() statements to test for the existence of each of these and to tell me what's happening should they not exist. Tell me why you're struggling with $aa, is what I say. And I get lines like this:
All becomes clear. The weight matrix is built using the 20 standard amino acid characters. The sequence that we want to score contains characters other than these 20. Hence, $weights{'row'}{$row}{$aa} doesn't exist if $aa is B, J, O, U, X or Z. All of which by the way are now standard IUPAC amino acid symbols.
The solution for now is simply not to score these troublesome sequences. For me this highlights a big problem when writing code for biology. You can write good, clean code, but it has to be able to handle many variations in the input data - most of which you become aware of by trial and error. When you grab stuff from databases, you're just never quite sure what you're going to get.
We're back to data standards again, aren't we.


