I’m a big fan of EMBOSS and I’m always finding new uses for it. Here’s a really simple fix that you might call “clean-up”.
Let’s say that you have a fasta file with an ID like “>gi|15232491|ref|NP_188759.1|”. Run that through any EMBOSS application (e.g. iep) and you’ll get a results line such as:
IEP of NP_188759.1 from 1 to 348 Isoelectric Point = 8.8631
Hmm. The application has decided to strip down the fasta ID. What if we want to parse the output, grab the ID and match it to the original fasta sequence? Well, we could try some regex matching and string processing but that’s error-prone, especially if we don’t know in advance with what IDs we might be dealing.
Seqret to the rescue. Seqret is a deceptively simple-looking EMBOSS app that can retrieve, read and write sequence. We can feed our fasta file to seqret like so:
seqret -sequence myfasta.fa -outseq myfasta2.fa
All that we’ve done is read in a fasta sequence and write it out again. However, because all EMBOSS apps strip fasta headers in the same way, the ID of our sequence in myfasta2.fa will read “>NP_188759.1”. Now when we pass myfasta2.fa to other EMBOSS apps, the IDs will match up. If you wanted, it wouldn’t be hard to create e.g. a Perl hash mapping the original IDs in myfasta to the stripped IDs in myfasta2.