Validating sequence formats

I’ve been using BioPerl for years – and just discovered that it doesn’t actually validate sequence formats. Go on, try it yourself. Grab any file on your hard drive – let’s say “feedlist.opml” from your feed reader and try this:

use Bio::SeqIO;
$seqio = Bio::SeqIO->new('-file' => "feedlist.opml", '-format' => 'fasta');
while($seq = $seqio->next_seq) {
print $seq->id, "\n";

No complaints. Not an error, a warning or an exception. Just the “<?xml” from the first line of the file.

My faith in the universe is quite shaken by this discovery. It’s just…wrong.

3 thoughts on “Validating sequence formats

  1. Maybe you should try blasting your opml list to see what comes up. Maybe you’ll get some of the sequences coming out of the metagenomics projects ;)
    It is a bit scary since I don’t think most of us actually check all the sequences by eye to see if there is something funny with them. There could be a “validate sequence” function to test sequences before using.

    No error either for passing a zip file to make_index:
    use Bio::Index::Fasta;
    my $Index_File_Name = “test”;
    my $inx = Bio::Index::Fasta->new(‘-filename’ => $Index_File_Name,’-write_flag’ => ‘WRITE’);

  2. Maybe you should try blasting your opml list to see what comes up

    Ha. Reminds me of that project where someone BLASTed the Oxford English Dictionary to find the longest protein “word” – which was “ensilists” at the time, as I remember. Update: the relevant bionet post is archived – showing my age!.

    This issue hangs on Bio::Tools::GuessSeqFormat. Obviously, guessing a sequence format by parsing is difficult. Normally, we download sequences and know in advance the format that they should be in. Validating format has only become of interest to me as I’m developing a web interface. You can’t rely on web users to input sequence in the correct format, so how do you check it? It’s a tricky problem.

Comments are closed.