The curse of amino acid alphabets revisited

I’ve discussed previously the problem of writing Perl to deal with rogue sequences. I think this is what makes writing biological software challenging; you never know what you’re going to encounter in an input file until it throws an error message, at which point you can devise a fix for that particular anomaly. Recently I’ve encountered a problem with the way in which Bioperl deals with alphabets.

Here’s the scenario: I have several hundred fasta files containing predicted protein sequences from both completed and draft microbial genomes. I’d like to tear through them and find (amongst other things), the molecular weight of each protein. Easy enough using Bio::SeqIO and Bio::Tools::SeqStats:

use strict;
use Bio::SeqIO;
use Bio::Tools::SeqStats;

my $infile = shift || die("Usage = mw.pl\n");
my $seqio = Bio::SeqIO->new('-file' => $infile, '-format' => 'fasta');

while(my $seqobj = $seqio->next_seq) { 
    my $seqstats = Bio::Tools::SeqStats->new($seqobj);
                                     }

Once in a while(), Bioperl will thrown an exception which looks like this:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG:  Alphabet not OK for Bio::Seq=HASH(0x86312b8)

This means that it doesn’t like some characters in the protein sequence. Realising that our sequences might contain “*”, “X”, “-” or almost anything other than an amino acid character, we can write this little fix:

while(my $seqobj = $seqio->next_seq) { 
    my $seq = $seqobj->seq;
    $seq = uc($seq);
    next if($seq =~/[^ACDEFGHIKLMNPQRSTVWY]/);
                                     }

This says: convert the sequence to upper case and if it matches anything other than the standard 20 amino acid characters, ignore it and move to the next sequence.

Rather strangely, even having done this at least one fasta file is still throwing the exception. It’s time to “catch that exception” using eval{} and see what’s going on:

while(my $seqobj = $seqio->next_seq) { 
    my $seq = $seqobj->seq;
    $seq = uc($seq);
    next if($seq =~/[^ACDEFGHIKLMNPQRSTVWY]/);

my $seqstats;
eval { $seqstats = Bio::Tools::SeqStats->new($seqobj); };
print "Problem:\t$infile\t", $seqobj->id, "\n" if($@);
                                     }

Running the script on all of the fasta files prints:

Problem:	fasta/genomes/Burkholderia_ambifaria_AMMD.500160000.faa	500030040

We can now examine the offending fasta file for protein ID 500030040. Here’s its sequence:

>500030040 BambDRAFT_0964  lipoprotein, putative [Burkholderia ambifaria AMMD]
GGGNGGGNGGGNGGGHGGGSGGHGNGNGGHGNGNGNGNGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGSGGSGGGGGSGGSGGSGG
AGGSGGQGNGGGSTGGHGIGGHGNGGGNGNGNGNGGAGSGGANGVGNGSG
GGGGGGSGGSAGGHGH

No illegal characters there. An awful lot of ‘G’ and a few ‘N’ though. You’d be surprised if that was a real protein. More importantly, why is Bioperl complaining about the alphabet? The answer is that with all those ‘G’ characters, it has assumed the sequence to be DNA – and you can’t pass a DNA sequence to Bio::Tools::SeqStats. We can confirm this using “print $seqobj->alphabet”.

One way around this would be to extract the id, description and sequence from the fasta file and use them to construct a new Bio::SeqI object with the correct alphabet:

while(my $seqobj = $seqio->next_seq) {
    my $id   = $seqobj->display_id;
    my $desc = $seqobj->desc;
    my $seq  = $seqobj->seq;
    $seq     = uc($seq);
    next if($seq =~/[^ACDEFGHIKLMNPQRSTVWY]/);
    my $newobj   = Bio::Seq->new('-display_id' => $id, '-desc' => $desc, '-seq' => $seq, '-alphabet' => 'protein');
                                     }

Frankly though, that’s quite a bit of work just to deal with a sequence that in all likelihood isn’t even genuine.

10 thoughts on “The curse of amino acid alphabets revisited

  1. Andrew Perry

    Sounds like a case of Bio::Seq trying to be too smart for it’s own good. Looking at the Bioperl docs, there doesn’t seem to be any optional ‘-alphabet’ argument that can be passed to it either.

    Biopython’s Bio.SeqIO.FASTA.FastaReader on the other hand does have an ‘alphabet’ arguement, so you can explicitly tell it in advance whether you are expecting DNA, RNA or protein sequence.

    (Not trying to start a “Bioperl vs Biopython” flamewar here, I was just curious so I checked the docs).

  2. nsaunders Post author

    Looking at the Bioperl docs, there doesn’t seem to be any optional ‘-alphabet’ argument that can be passed to it either

    Unfortunately not – it seems that you can only “get”, not “set’ -alphabet, unless creating a new Bio::SeqI object as in the last bit of code. Something of a Bioperl flaw, I agree.

  3. Joerg_H

    Hi,

    great, I ran into the same problem yesterday ;-)

    This looks indeed like a bug, since you can specify the alphabet for a Bio::SeqIO object (below), but it is ignored. The following code snippet is based on an example from the BioPerl website, and it will bail out on a sequence such as yours – when you print the alphabet, it’s DNA!

    my $infile, $outfile;
    my $seq_in = Bio::SeqIO->new(‘-file’ => ” “fasta”,
    ‘-alphabet’ => ‘protein’);
    my $seq_out = Bio::SeqIO->new(‘-file’ => “>$outfile”,
    ‘-format’ => “swiss”);

    while (my $inseq = $seq_in->next_seq) {
    $seq_out->write_seq($inseq);
    }

  4. nsaunders Post author

    There’s some discussion in the Bioperl mailing lists about whether Bio::SeqIO should validate. A resounding “yes” from me. It would make dealing with user input to CGI scripts, for example, much easier.

    I didn’t know that Nestlé had a bioinformatics division! Do they pay well? Are there many openings? Is there an Australian branch?

  5. Chris Fields

    This was posted as a bug and has been fixed in CVS. The alphabet wasn’t set in sequence stream object initialization. This should be included in the final 1.5.2 developer release this week.

    I would like to point out in our defense that, once this bug was posted on the BioPerl Bugzilla, the problem was addressed, tested, and committed to CVS within 24 hours of the original bug notification (if only all of them were this simple). Unless we are notified of a bug or a problem, we can’t fix it!

    Sequence validation: this has been requested many times before. The current thought on this is to have a set of validation classes capable of checking the sequences prior to parsing, the idea being this would keep the parsers focused on parsing issues, while validation classes could deal with validation issues. Validation could then be implemented within SeqIO using these new classes. However, no one has had the time to implement this scheme. We are always looking for help if anyone wants to chip in!

  6. nsaunders Post author

    Thanks for all the comments Chris. I’m always impressed by the responses of the Bioperl community.

    Now that I think about validation after opening my mouth, I appreciate the problems. It would make code slower, for one. Also, would we validate the whole file or just look for indications that the format is most likely correct? In some ways, the current “read first, ask questions later” approach makes sense – in other words, don’t throw errors until you try to do something with the sequence and find that you can’t.

    I guess I was thinking along the lines of e.g. a CGI form which asks for a particular format such as fasta – validation here would be useful to check that the user hasn’t uploaded swissprot, or junk, or code to hack the server.

  7. Chris Fields

    I think, if validation were separated into another set of classes, this would be pretty straightforward. The sequence could pass through a validator first to check format, then be parsed or used as input for downstream apps once validation passed. If you were interested only in making sure a record is valid (such as for input for a downstream app), you could use just the validator and not pass through SeqIO.

    Putting these together is not a trivial task, but possible.

  8. Joerg_H

    @Chris: I have to contradict … you fixed it in less that 12 hours :-)) … now I just hope that the next “packaged” release will come soon, since we do not – never! – use cvs versions for production systems.

    > I didn’t know that Nestlé had a bioinformatics division!

    Yes they have, and a quite active one ;-). A little bit of info: http://www.research.nestle.com/

    > Do they pay well?

    “We get much less than we deserve” :-) Probably less than in Pharma industry anyway, but OTOH we do not need a second job either ;-)

    > Are there many openings? Is there an Australian branch?

    Not many – currently we have exactly one job opening, but these are all in Lausanne.

    Cheers!

  9. Chris Fields

    From the Bioperl mail list:

    “I am proud to announce the final release of Bioperl 1.5.2.”

    http://www.bioperl.org/wiki/Release_1.5.2

    I’ll note we’re still in a developer cycle, but this release is quite stable. There are a few experimental modules that need to be worked out a bit more but the core modules work well.

Comments are closed.