Spent much of the day pondering HMMER, weak motifs and extraction of information from HMMER alignments to Pfam models. Many proteins contain short, weakly-conserved motifs. One that I’m working on at the moment contains a “GEL” motif, actually 6 residues that often, but not always, begin “GEL”.
The Pfam profile for this protein is a little different – “GLrlldL”. In the case where my protein does contain GEL, it tends to align to the profile like so:
There are other cases that anchor quite nicely:
But then cases where although a regex fails to detect GEL, alignment still occurs:
GLrlldL ++ r+ TVyRVSR
One question would be: in the latter case, do we trust the Pfam alignment over the established regex? Another problem: it would be useful to be able to pull these short regions out of the alignment. Unfortunately, BioPerl’s Bio::SimpleAlign object doesn’t extend to HMMER alignments yet. Extracting using string functions is also quite tricky as gaps may appear at the site in either the profile (where they show as “.”) or the query (where they show as “-“). So, stick with messy regexps or parse and rewrite hmmer alignment to something useable? You guessed, the latter.