The y-axis: to zero or not to zero

I don’t “do politics” at this blog, but I’m always happy to do charts. Here’s one that’s been doing the rounds on Twitter recently:

What’s the first thing that comes into your mind on seeing that chart?

It seems that there are two main responses to the chart:

  1. Wow, what happened to all those Democrat voters between 2008 and 2016?
  2. Wow, that’s misleading, it makes it look like Democrat support almost halved between 2008 and 2016

The question then is: when (if ever) is it acceptable to start a y-axis at a non-zero value?

1. What would ggplot2 say?

Let’s get some data into ggplot2 and find out. There’s lots of publicly-available election data; I’m using Wikipedia pages such as this one for the 2008 US election.

library(tidyr)
library(ggplot2)
library(scales)

# popular vote, electoral college votes and turnout 1980-2016
elections <- data.frame(year = c(1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016),
Rep.pop = c(43903230, 54455472, 48886097, 39104550, 39197469, 50456002, 62040610, 59948323, 60933504, 61201031),
Dem.pop = c(35480115, 37577352, 41809074, 44909806, 47401185, 50999897, 59028444, 69498516, 65915794, 62523126),
Rep.ec = c(489, 525, 426, 168, 159, 271, 286, 173, 206, 306),
Dem.ec = c(49, 13, 111, 370, 379, 266, 251, 365, 332, 232),
turnout = c(52.6, 53.3, 50.2, 55.2, 49.0, 51.2, 56.7, 58.2, 54.9, 53.7))

Let’s tidy that up a little so as the data are in “long format” with one variable per column, one value per variable:

elections.1 <- gather(elections, key, value, -year)
elections.2 <- separate(elections.1, key, into = c("variable", "vote"), sep = "\\.")

Now we try to replicate the chart seen on Twitter:

ggplot(subset(elections.2, vote == "pop" & year > 2004), aes(year, value))
  + geom_bar(aes(fill = variable), stat = "identity", position = "dodge")
  + scale_x_continuous(breaks = seq(1980, 2016, 4))
  + scale_fill_manual(values = c("blue", "red"))
  + scale_y_continuous(labels = comma) + theme_bw()
  + labs(title = "US Election Popular vote 2008 - 2016")

ggplot2 bar chart with default y-axis scale

ggplot2 bar chart with default y-axis scale


ggplot2 by default starts the y-axis at zero when the chart is a bar chart. To replicate the Twitter chart, we add some extra options to scale_y_continuous.

ggplot(subset(elections.2, vote == "pop" & year > 2004), aes(year, value))
  + geom_bar(aes(fill = variable), stat = "identity", position = "dodge")
  + scale_x_continuous(breaks = seq(1980, 2016, 4))
  + scale_fill_manual(values = c("blue", "red"))
  + scale_y_continuous(labels = comma, limits = c(52000000, 72000000), oob = rescale_none)
  + theme_bw() + labs(title = "US Election Popular vote 2008 - 2016")

ggplot2 bar chart with forced y-axis scale

ggplot2 bar chart with forced y-axis scale

There seem to be two reactions to this chart. One is that it’s effective in showing the decline in the Democrat popular vote since 2008, whilst the Republican vote has stayed relatively stable. The other is that by truncating the y-axis, the chart misleads people into thinking that the Democrat vote in 2016 is around 60% that of 2008. To be honest, I can see both points of view. Personally, my eye is drawn to the absolute values on the y-axis, but perhaps that is just me (and others like me).

This article tells us that “it’s OK not to start your y-axis at zero”, but then states that “column and bar charts should always have zeroed axes”. They use a chart from the Twitter IPO as an example.

If you were waiting for the obligatory bad-mouthing of Excel, look no further than a follow-up Tweet by the chart author.

Onwards. What if we use a line chart instead?

ggplot(subset(elections.2, vote == "pop" & year > 2004), aes(year, value))
  + geom_line(aes(color = variable)) + geom_point()
  + scale_x_continuous(breaks = seq(1980, 2016, 4))
  + scale_color_manual(values = c("blue", "red"))
  + scale_y_continuous(labels = comma)
  + theme_bw() + labs(title = "US Election Popular vote 2008 - 2016")

ggplot2 line chart with default y-axis scale

ggplot2 line chart with default y-axis scale

Now ggplot2 thinks that it’s fine to use a non-zero y-axis. The eye no longer compares absolute heights.

How does the line chart look if we force the y-axis back to starting at zero?

ggplot(subset(elections.2, vote == "pop" & year > 2004), aes(year, value))
  + geom_line(aes(color = variable)) + geom_point()
  + scale_x_continuous(breaks = seq(1980, 2016, 4))
  + scale_color_manual(values = c("blue", "red"))
  + scale_y_continuous(labels = comma, limits = c(0, 72000000))
  + theme_bw() + labs(title = "US Election Popular vote 2008 - 2016")

ggplot2 line chart with forced y-axis scale

ggplot2 line chart with forced y-axis scale

I think the blue decline is still apparent. The main issue with this one for me is not any attempt to mislead, just a lot of wasted white space.

2. What would Tufte say?

A common response is to ask what Tufte would say. You can read what he says here. In that particular quote he says the y-axis should reflect the range of the data and had nothing specific to say regarding bar charts. His last sentence is telling:

Instead, for context, show more data horizontally!

ggplot(subset(elections.2, vote == "pop"), aes(year, value))
  + geom_line(aes(color = variable)) + geom_point()
  + scale_x_continuous(breaks = seq(1980, 2016, 4))
  + scale_color_manual(values = c("blue", "red"))
  + scale_y_continuous(labels = comma) + theme_bw()
  + labs(title = "US Election Popular vote 1980 - 2016")

ggplot2

ggplot2 line chart starting from year 1980

And indeed, it is interesting to add more elections starting from the year 1980.

3. In summary

It is interesting to see people react differently to the same chart. A cynic might say “often in a manner that reflects their beliefs.” However, the current collective wisdom seems to be:

  • it’s OK to start your y-axis at a non-zero value
  • unless it’s a bar/column chart
  • listen to Tufte

2 thoughts on “The y-axis: to zero or not to zero

  1. samsamg

    Thanks for the post and for Tufte’s way of thinking. Although there is no statistical test here, I just want to add a connection to hypothesis testing. Comparing averages ignores the zero (or any reference value). Human interpretation needs to state one reference, or may question about the need to use a reference, as you did here.Best.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s