The real meaning of spurious correlations

Like many data nerds, I’m a big fan of Tyler Vigen’s Spurious Correlations, a humourous illustration of the old adage “correlation does not equal causation”. Technically, I suppose it should be called “spurious interpretations” since the correlations themselves are quite real, but then good marketing is everything.

There is, however, a more formal definition of the term spurious correlation or more specifically, as the excellent Wikipedia page is now titled, spurious correlation of ratios. It describes the following situation:

  1. You take a bunch of measurements X1, X2, X3…
  2. And a second bunch of measurements Y1, Y2, Y3…
  3. There’s no correlation between them
  4. Now divide both of them by a third set of measurements Z1, Z2, Z3…
  5. Guess what? Now there is correlation between the ratios X/Z and Y/Z

It’s easy to demonstrate for yourself, using R to create something like the chart in the Wikipedia article.

First, create 500 observations for each of x, y and z.

library(ggplot2)

set.seed(123)
spurious_data <- data.frame(x = rnorm(500, 10, 1),
                            y = rnorm(500, 10, 1),
                            z = rnorm(500, 30, 3))

Next, convince yourself that x and y are uncorrelated.

cor(spurious_data$x, spurious_data$y)
# [1] -0.05943856
spurious_data %>% ggplot(aes(x, y)) + geom_point(alpha = 0.3) + 
theme_bw() + labs(title = "Plot of y versus x for 500 observations with N(10, 1)")

spurious1

Finally, repeat step 2 after dividing x and y through by z.

cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
# [1] 0.4517972
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) +
 theme_bw() + geom_smooth(method = "lm") + 
scale_color_gradientn(colours = c("red", "white", "blue")) + 
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 3)")

spurious2

This effect is reasonably intuitive: dividing both x and y by the same z forces them into the range (0,1). Larger values for z push x/z and y/z towards lower values, smaller values of z push them towards higher values. Visually though, it’s quite a striking effect which is even more pronounced if we increase the standard deviation for z.

spurious_data$z <- rnorm(500, 30, 6)
cor(spurious_data$x / spurious_data$z, spurious_data$y / spurious_data$z)
# [1] 0.8424597
spurious_data %>% ggplot(aes(x/z, y/z)) + geom_point(aes(color = z), alpha = 0.5) + 
theme_bw() + geom_smooth(method = "lm") + 
scale_color_gradientn(colours = c("red", "white", "blue")) + 
labs(title = "Plot of y/z versus x/z for 500 observations with x,y N(10, 1); z N(30, 6)")

spurious3

Which looks an awful lot like a chart that I found recently in a published article that I was reading at work. Note that both axes are rates between 0-100 (percentages), for schools. Were the x and y values for each point divided by total pupils for that school? I’ll have to go back and check the methodology.

spurious4

Should you wish to compare ratios or “relative” measurements, consult this reference and take a look at the R package propr, which implements methods for proportionality.

11 thoughts on “The real meaning of spurious correlations

  1. Hi, thanks for great post.

    Maybe you can recommend some articles about the usage of proportionality vs causality on the examples not from biology, but social sciences?

    • I wish I could but I guess technically, I’ve only been in social science for 4 weeks :) I’m sure examples abound in the literature though, the example I posted was from a study into predicting school examination performance.

  2. Sorry for not fitting in one comment. Additional question: in the vignette authors state that you should switch to ‘perb’ or ‘phi’ when dealing with proportions, however they don’t mention the interpretation of these new metrics. Suppose I know that correlation of X and Y is 0.95, then I’ll probably say that these two variables are highly correlated (in other words connected), but if I get 0.95 perb I’d probably say that those variables are highly proportional. Will their interconnection (interdependence) follow from it (I’m not speaking about causality here)?

    • Konstantin, thanks for checking out the `propr` package. You raise a great question, “How proportional is proportional enough?” Unfortunately, this is hard to answer. Without a hypothesis testing framework, I use the `smear` function to try to visually identify an agreeable cut-off. As a collaborator joked, “Just keep lowering the cut-off until you get bored or run out of grant money.” Keep in mind that, like correlation, the false discovery rate will depend on the number of samples and features in your data. We recommend > 0.98 as very safe. As Neil hinted, you can reach me on Twitter (@tpq__) if you have any questions.

  3. interesting post,

    and not something I had thought of before. One thing though. On the figure of predicted vs actual you cite, and subsequently state “… Note that both axes are rates between 0-100 (percentages), suggesting that values were divided by a common divisor….” The figure caption states these are rates, and therefore I think it’s safe to assume the values were divided by a common divisor. So no reason to wonder. And as such, perhaps a little less relevant to your observations which aren’t dividing by a constant, as in the case of a rate, but rather dividing by a variable. Regardless, still an interesting post.

    • I’d need to refer back to the publication, but I believe that each point is a school, the divisor being total students for that school. So they probably are dividing by a variable.

      • That makes it a doubly interesting post ;-). You would *hope* that for something like this, standardized rates would be used. But you may well be right. If they didn’t standardize the rates, then the divisor would vary. And depending on where this was done, if the divisor was total students, and the points represent individual schools, then there could be wide variability in that divisor. Which, as you point out, makes the effect you’ve highlighted even more dramatic.

        • As I say, I’ll need to go back and check the methodology. I was just struck initially by the high similarity between that chart and the simulated correlated ratios data.

  4. Neil, very interesting post on a very interesting effect!
    While I am not the greatest data science expert, I kind of disagree with you on the school data example. Even if they plot percentages as you assume (and I think they do), this is probably the correct approach. The reason is that by contrast to your randomized examples, the data are the *success rates*, not the number of successful students. The prediction is not for a particular number of student to pass, but for a particular success rate.
    If (as you suggest) the authors plot the number of students rather than the sucess rates, you would get a rather meaningless graph: There would be a trivial posivite correlation, because bigger schools will have more students passing and also more students predicted to pass. I think that in this situation, plotting the percentages is the right thing to do

    • You may well be correct! I rather regret including this example, since I have not fully investigated their methodology. I just thought at first glance, the high similarity to the simulated data chart plus the use of percentages was interesting and worth following up. Whether it is a genuine example of spurious correlation, I do not yet know.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s