I’ve been employing an equation in one of my Perl modules. It’s used to convert a matrix of frequencies at positions in a sequence to a matrix of weights; you see this called a position-weight matrix (PWM) or position site-specific matrix (PSSM).
The equation looks like this (behold, the WordPress LaTeX plugin):
where f(b,i) is frequency of base at position i, N is the sum of frequencies for a column and p(b) is the prior probability of the base in the sequence. Those terms with N are a statistical fudge called a pseudocount.
The thing is – I’ve been using it somewhat empirically – which means that it seems to do what I want, but I’m not confident that my usage is justifiable in terms of the theory. So if you’re a PSSM expert, here are 3 questions for you:
- If I were to apply this equation to protein sequences, rather than DNA, I’d assume that simply replacing ‘4’ with ’20’ is all that’s required?
- Nowhere have I read that N must be equal for all columns. Normally it is, because a frequency matrix is derived from an alignment and so N is just the number of sequences. But suppose that each column can be derived independently from a different number of sequences? Is there any objection to non-equal N?
- I’m not finding it easy to track down literature that describes this or similar equations, via PubMed or Google (“Google equation search” would be nice). Anyone care to recommend a review or even some documentation? I’m sure I saw this in an R-package once.
If anyone comments this post I’ll be (a) amazed and (b) eternally grateful.