I’ve been employing an equation in one of my Perl modules. It’s used to convert a matrix of frequencies at positions in a sequence to a matrix of weights; you see this called a position-weight matrix (PWM) or position site-specific matrix (PSSM).

The equation looks like this (behold, the WordPress LaTeX plugin):

where f(b,i) is frequency of base at position i, N is the sum of frequencies for a column and p(b) is the prior probability of the base in the sequence. Those terms with N are a statistical fudge called a pseudocount.

The thing is – I’ve been using it somewhat empirically – which means that it seems to do what I want, but I’m not confident that my usage is justifiable in terms of the theory. So if you’re a PSSM expert, here are 3 questions for you:

- If I were to apply this equation to protein sequences, rather than DNA, I’d assume that simply replacing ‘4’ with ’20’ is all that’s required?
- Nowhere have I read that N must be equal for all columns. Normally it is, because a frequency matrix is derived from an alignment and so N is just the number of sequences. But suppose that each column can be derived independently from a different number of sequences? Is there any objection to non-equal N?
- I’m not finding it easy to track down literature that describes this or similar equations, via PubMed or Google (“Google equation search” would be nice). Anyone care to recommend a review or even some documentation? I’m sure I saw this in an R-package once.

If anyone comments this post I’ll be (a) amazed and (b) eternally grateful.

### Like this:

Like Loading...

*Related*

No, replacing 4 with 20 is not sufficient.

In contrast to nucleotides, residues have positively-scoring matches with different values.

The equivalent of the BLOSUM62 or PAM90 matrix

for nucleotides has the same value for all diagonal

and for all off-diagonal elements.

You might be interested in

Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005

Amazed and eternally grateful.

And still confused. The sqrt(N)/4 term is a pseudocount, right? It just adds something to the frequency count to “correct” low frequencies. And the assumption in this case is that the pseudcount is shared equally between all 4 bases. I can see that using prior probabilities rather than 1/4 might be better, but I’m not clear what this has to do with substitution matrices such as BLOSUM.

Depending on your broader purposes, you might consider using the HMMER suite to compute these kinds of scores, instead. It has quite an elaborate and effective method for computing them, particularly from protein alignments. For instance, in order to avoid overrepresentation from phylogenetically close homologs, it clusters the sequences in the alignment by sequence similarity, and weights their contributions to the score accordingly. I’m happy to explain how to parse the scores out of HMMER files, if it would help.

Full disclosure: I’m doing a postdoc with Sean Eddy, the author of HMMER.

Thanks for the HMMER info. I’m a big fan of HMMER – I use it a lot as part of this project, running it through BioPerl wrappers and parsers. Another group in the field has used it for a similar purpose to mine so there’s a bit of an originality issue there.

I should probably expand a little on my “sequences”. They are actually heptapeptides, representing motifs from a set of proteins that are related by their interaction with different protein kinase families. So from an alignment (fasta format) of heptapeptides I want to ask the question: how “significant” is the amino acid frequency at each position 1 – 7? In other words, p(b) in the equation is the frequency of an amino acid in that set of sequences, f(b,i) the frequency at positions 1 – 7 in the heptapeptide motif. All I really need to know is whether the equation above can be adapted for proteins rather than nucleic acids and if so how, particularly with respect to pseudocounts.