Bacteria and Alzheimer’s disease: I just need to know if ten patients are enough

You can guarantee that when scientists publish a study titled:

Determining the Presence of Periodontopathic Virulence Factors in Short-Term Postmortem Alzheimer’s Disease Brain Tissue

a newspaper will publish a story titled:

Poor dental health and gum disease may cause Alzheimer’s

Without access to the paper, it’s difficult to assess the evidence. I suggest you read Jonathan Eisen’s analysis of the abstract. Essentially, it makes two claims:

  • that cultured astrocytes (a type of brain cell) can adsorb and internalize lipopolysaccharide (LPS) from Porphyromonas gingivalis, a bacterium found in the mouth
  • that LPS was also detected in brain tissue from 4/10 Alzheimer’s disease (AD) cases, but not in tissue from 10 matched normal brains

Regardless of the biochemistry – which does not sound especially convincing to me[1] – how about the statistics?

LPS was detected in 0/10 normal brains, compared with 4/10 AD brains. The “tl;dr” version of this discussion – if you think that those look like rather small numbers, you’re correct.

We can set up a matrix in R to contain those values.

ad <- matrix(c(4, 6, 0, 10), nrow = 2)
colnames(ad) <- c("AD", "Norm")
rownames(ad) <- c("lps+", "lps-")

#      AD Norm
# lps+  4    0
# lps-  6   10

Are those proportions significantly different? Or to put it another way: “I just need to know if 3 (10) patients are enough.” Let’s talk about statistical power.

Without going too deeply into the mathematics, the power of a statistical test is a number between 0 and 1, which describes the probability of a type II (false negative) error. For example, when power = 0.8, the probability of a false negative (concluding no difference between groups when in fact, there is one) is 0.2.

We’re looking at proportions in two groups (a two-proportion test), where the power depends on several parameters:

  • sample size (n, per group)
  • the proportions in each group (p1 and p2)
  • probability of type I (false positive) error (sig.level)
  • whether the test is one- or two-sided (alternative)

R provides us with the function power.prop.test():


     power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05,
                     power = NULL,
                     alternative = c("two.sided", "one.sided"),
                     strict = FALSE)

How it works: you set one of the parameters n, p1, p2, sig.level or power to NULL and it is calculated from the other parameters.

To get started – what’s the power of the study in the publication, using sig.level = 0.05?

ppt <- power.prop.test(n = 10, p1 = 0, p2 = 0.4)
# [1] 0.6250675

Effectively, what that means is that the probability of a false negative (concluding no difference in LPS detection between normal and AD brains when there is a difference) is about 0.375. That’s rather high.

Most researchers set power = 0.8 as an acceptable threshold. So – how many samples per group do we need to achieve power = 0.8 at sig.level = 0.05?

ppt <- power.prop.test(p1 = 0, p2 = 0.4, power = 0.8)
# [1] 14.45958

About 15 samples. Not many more than 10 – but more than 10, nevertheless. How about something more stringent: power = 0.9, sig.level = 0.01?

ppt <- power.prop.test(p1 = 0, p2 = 0.4, power = 0.9, sig.level = 0.01)
# [1] 27.16856

Note that with larger sample sizes (e.g. 100 per group), the proportion of normal brains containing LPS can be quite high compared with AD brains and still be significant (at p = 0.05):

ppt <- power.prop.test(p2 = 0.4, power = 0.8, n = 100)
# [1] 0.2180086

Finally, a plot showing the increase of power with group sample size at p = 0.05, p = 0.01 (click for larger version):

Power versus sample size

Power versus sample size at p = 0.01, p = 0.05

# quick, dirty, ugly, but it works
df1 <- data.frame(n = 10:30,
                  p = sapply(10:30, function(x) power.prop.test(p1 = 0, p2 = 0.4, n = x)$power),
                  s = 0.05)
df2 <- data.frame(n = 10:30,
                  p = sapply(10:30, function(x) power.prop.test(p1 = 0, p2 = 0.4, n = x, sig.level = 0.05)$power),
                  s = 0.05)
df3 <- rbind(df1, df2)
ggplot(df3) + geom_point(aes(n, p, color = factor(s))) + theme_bw()

So much for power. How about testing for a difference between the groups?

You might be tempted to reach for the two-proportion test, implemented in R as prop.test(). You should not – but here’s the result anyway:


	2-sample test for equality of proportions with continuity correction

data:  c(0, 4) out of c(10, 10)
X-squared = 2.8125, df = 1, p-value = 0.09353
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.803636315  0.003636315
sample estimates:
prop 1 prop 2 
   0.0    0.4 

Warning message:
In prop.test(c(0, 4), c(10, 10)) :
  Chi-squared approximation may be incorrect

Note the warning message. That’s telling you that we don’t have enough samples for the calculation to be informative. The so-called Cochran conditions stipulate that no cell should contain a count of zero and more than 80% of cells should have counts of at least 5. Some power calculators, such as this one, will tell you when these assumptions have been violated.

The alternative is Fisher’s exact test:


	Fisher's Exact Test for Count Data

data:  ad
p-value = 0.08669
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.7703894       Inf
sample estimates:
odds ratio 

In summary then: not only is the conclusion that “LPS from periodontal bacteria can access the AD brain during life” rather premature, but this study “lacks power”. We cannot say whether LPS detection in the AD brains is significantly different to that in normal brains.

[1] and I used to be a biochemist (D. Phil., 1997)

4 thoughts on “Bacteria and Alzheimer’s disease: I just need to know if ten patients are enough

  1. Actually, the test of proportions is not bad even when the Cochran conditions are violated — Fisher’s exact test is pretty conservative in that setting — but the conclusion is pretty much the same.

    With a bit more work in R you can look at the complete sampling distribution of the test statistic under the hypothesis p1=p0=p for a dense grid of values of p between 0 and 1, and take the most pessimistic p-value. That’s also an exact (ie, conservative) test and it turns out to agree better with the ordinary prop.test() than it does with fisher.test(). The advantage of fisher.test() is not that the false positive rate is closer to 0.05 but that it guarantees the false positive rate is never greater than 0.05.

  2. Thanks for this nice refresh of power analysis. I mainly agree with you. Nevertheless I have a few comments.
    I didn’t read the paper, but I guess the authors showed a statistically significant p-value. Looking for statistical significance leads to the choice of the alternative hypothesis. Because your computation does not state it, the alternative hypothesis is taken as “two.sided”. When using the prop.test with an alternative hypothesis “greater”, the 5 percent threshold is crossed [a]. I guess this allowed the referees to accept the result.
    Given the ad matrix, ad must be transposed in the prop.test formula. This leads to the result you show. prop.test(ad) does not.
    Using “greater” as alternative hypothesis gives more acceptable figures. Power is about 0.75 and n is now 12. [b]
    The Cochran conditions are interestingly discussed at It concerns the **expected** counts, not the observed ones.
    Of course, whether the test should be two sided or not is still controversial: What is more clear is the misuse of p-value:

    	2-sample test for equality of proportions with continuity
    data:  t(ad)
    X-squared = 2.8125, df = 1, p-value = 0.04677
    alternative hypothesis: greater
    95 percent confidence interval:
     0.04518037 1.00000000
    sample estimates:
    prop 1 prop 2 
       0.4    0.0 
    Warning message:
    In prop.test(t(ad), alternative = "greater") :
      Chi-squared approximation may be incorrect
    [1] 0.7525941
    [1] 11.26908
    • Thanks for the comment; yes, t(ad) is required for prop.test.

      Not many people have read the paper – it does not seem to be readily-accessible for most. They may, as you suggest, have used one-sided greater.

Comments are closed.