Thursday, January 31, 2013

Why Inference Matters

This is a post inspired by a recent exchange on twitter between myself and friend, colleague, and TNR writer Nate Cohn (@electionate). The initial exchange was pretty superfluous, but it got me thinking about a broader question how writers about politics should approach quantitative data. That spiraled into this rather long-winded post. Indeed, the original conversation gets lost, but I think a more important broader point gets made. As political journalists and analysts incorporate more data into their writing, they would benefit by thinking in terms of statistical inference. It's just not possible to meaningfully talk about "real-world" data - electoral, or otherwise - without also talking about probability.

First, a quick summary of the discussion that led me to write this: Gary King tweeted a link to this post by Andrew C. Thomas covering a paper written by himself, King, Andrew Gelman and Jonathan Katz that examined "bias" in the Electoral College system. The authors found that despite the winner-take-all allocation of votes, there was no statistically significant evidence of bias in recent election years. Essentially, they modeled the vote share of each candidate at the district level and estimated the Electoral College results for a hypothetical election where the popular vote was tied. Andrew made a graph of their "bias" estimates - all of the 95% confidence intervals in recent years contain 0.

Nate shot back a quick tweet response asking why the results suggest no bias in 2012 when the electorally "decisive" states were 1.6 percentage points more democratic than the national popular vote.

This led to a longer and rather interesting discussion between Nate and myself on how to evaluate Electoral College bias (for what it's worth, my twitter arguments were pretty bad). Nate made some good points about differences in Obama's margin-of-victory in the states needed to win the Electoral College versus his overall popular vote margin-of-victory. He notes that Obama won the states needed to reach 272 by a minimum of 5.4% while his popular vote margin was only 3.9%. If the existence of the Electoral College exaggerated Obama's vote advantage relative to a popular vote system, then it's possible to conclude that the college was "biased" in Obama's favor.

Nate has since turned this into part of an article, arguing:
The easiest way to judge the Democrats’ newfound Electoral College advantage is by comparing individual states to the popular vote. Last November, Obama won states worth 285 electoral votes by a larger margin than the country as a whole, suggesting that Obama would have had the advantage if the popular vote were tied. But the current system appears even more troubling for Republicans when you consider the wide gap between the “tipping point” state and the national popular vote. Obama’s 270th electoral vote came from Colorado, which voted for the president by nearly 5.4 points—almost 1.8 points more than his popular vote victory. Simply put, the GOP is probably better off trying to win the national popular vote the state contests in Pennsylvania or Colorado, since the national popular vote was much closer in 2012 than the vote in those tipping point states. Obama enjoyed a similar Electoral College advantage in 2008. 
Here the wording is slightly different - the term is Electoral College "advantage" rather than Electoral College "bias," but the argument is essentially the same - currently the electoral geographic landscape is such that the Democrats benefit disproportionately - a shift to a popular vote system would help the Republicans.

The argument is interesting (and could in fact be true), but the evidence presented doesn't really say anything. The big problem is that it ignores probability. You can't make a credible argument about the "nature" of an election cycle just by comparing election results data points without a discussion of uncertainty in the data. The results of an election are not the election itself - they are data. The data are what we use to make inferences about certain aspects of one election (or many elections). This distinction is essential. We don't observe whether the Democrats have an Electoral College advantage in 2012 or whether Colorado was more favorable to the Democrats than Virginia or Ohio. We can't observe these things because all we have are the output - the results.

Studying elections is like standing outside an automotive plant and watching cars roll off the assembly line.  We never get to see how the car is put together; we only see the final product.

If we knew exactly what went into the final vote tally - that is, if we were the plant managers and not just passive observers - then we wouldn't need statistics. But reality is complicated and we're not omniscient. This is what makes statistical inference so valuable - it lets us quantify our uncertainty about complex processes.

Just for curiosity, I decided to grab the 2012 election data to take a closer look at Nate's argument. I estimated the number of electoral votes that each candidate would receive for a given share of the two-party popular vote. I altered the state-level vote shares assuming a uniform swing - each state was shifted by the difference between the "true" and "hypothesized" popular vote - and computed the corresponding "electoral vote" that the candidate would receive. For time/simplicity, I ignored the Maine/Nebraska allocation rules (including them doesn't affect the conclusions).

To clarify, I measure the Democratic candidate's share of the two-party vote as opposed to share of the total vote. That is, %Dem/(%Dem + %Rep). This measure is common in the political science literature on elections and allows us to make better comparisons by accounting for uneven variations in the third-party vote. 

Here's what the 2012 election looks like for President Obama. The dotted vertical line represents the observed vote total, the blue line marks the "tie" point and the horizontal red line represents 270 electoral votes.



This is consistent with Nate's argument. There is a space of potential popular vote outcomes where Obama loses the two-party vote but still wins the Electoral College - the "272" firewall.

How about 2008?

Same thing - about a 1 percentage point loss of the popular vote would have still mean an Electoral College victory for Barack Obama - again Colorado is key here. Indeed, the advantage appears to be even larger here than in 2012.

2004?

Here the advantage is less perceptible - barely a fifth of a percentage point. Compared to 2008 and 2012, this would suggest a "trend" towards greater pro-Democratic bias in the Electoral College

2000?


Bush obviously has the advantage here (and as we know, he lost the popular vote)

How about jumping to 1992?


Again the elder Bush has a slight advantage.

I could keep going. One could infer a story of a growing democratic advantage in the Electoral College from these five data points - it's there in the most recent two cycles and it wasn't there before. But in the end, these graphs are not at all meaningful.

The problem with what I've done above is that it's at best a convoluted summary of the data - the implied inferences about "advantage" are absurd absent a discussion of probability. Consider the assumptions behind the argument. It implicitly assumes that if we re-ran the 2012 election from the beginning, we would get the exact same results. It follows that if we re-ran the election and posited that Pres. Obama received only 50% or 49% of the two-party vote, then the results in each state would shift exactly by 1-2%.

Of course this is crazy. We easily observe that from year to year, changes in two-party vote shares are not constant across all states (this is why the uniform swing assumption is a statistical modeling assumption and not a statement of fact). Without probability our counterfactual observations of the electoral vote are nonsense, since we implicitly assume that there is zero variation in the vote share. This is certainly not the case. If we could hypothetically re-run the election, we would not expect the vote share to be exactly the same. There are a host of elements, from the campaign to the weather on election day, that could shift the results. We would expect them to be close, but we are inherently uncertain about any counterfactual scenario.

However, we cannot reason about the 2012 election without considering counterfactuals - what would the result have been had A happened instead of B. The problem is that we only get to observe the election once - we have to estimate the counterfactual, and all estimates are uncertain.

This is where thinking in terms of inference becomes useful. Political analysts want to move beyond summarizing the data (election returns) and make some meaningful explanatory argument about the election itself. It's difficult to do this in a quantitative sense without accounting for uncertainty in the counterfactuals.

Here is one way of evaluating Nate's argument using the same election result data that incorporates probability. It's very much constrained by the data, but that's kind of the point - looking just at a couple of election results doesn't tell us much.

The core question is: could the vote share gap that Nate identifies be due to chance?

Let's imagine that the Democratic party candidate's two-party vote share in any given state is modeled by the following process

$$v_i = \mu + \delta_i + \epsilon_i$$

$v_i$ represents the two-party vote share in state $i$ received by the Democratic party candidate. $\mu$ is the 'fixed' "national" component of the vote - the component of the vote accounted for by national-level factors like economic growth. It does not vary from state-to-state. $\delta_i$ is the 'fixed' state component of the vote share - it reflects the static attributes of the state like demographics. For a state like California, it would be positive. For Wyoming, negative. This is the attribute that we're interested in. In particular, can we say with confidence from the evidence that Colorado, Obama's "firewall" state, structurally favors the Democrats more than a state like Virginia where the President's electoral performance roughly matched the popular vote? Or, is the gap that we observe attributable to random variation.

That's where $\epsilon_i$ comes in. $\epsilon_i$ represents the component of error that's not unique to the election year. That is, if we were to re-run the election, the differences in the observed $v_i$ will be a function of $\epsilon_i$ taking on a different value. This represents the "idiosyncracies" of the election - weather, turnout, etc... $\epsilon_i$ is what introduces probability into the analysis.

For the sake of the model, we assume that $\epsilon_i$ is distributed normally with mean $0$ and variance $\sigma_i^2$. I'll allow each $\sigma_i$ to be different - that is, some states might exhibit more random variation than others.

Normally the analyst would then estimate the parameters of the model. However, I have neither the data nor the time to gather it (if you're interested in data, see the Gelman et. al. paper). My goal with this toy example is to show that the observed difference in the 2012 vote share between the "270th" state, Colorado, and a comparable state like Virginia, which has mirrored the popular vote share, might be reasonably attributed to random fluctuations within each state. To do so, I generate rough estimates of some parameters while making sensible assumptions about others.

As an aside, I could also have done a comparison between Colorado and a national popular vote statistic. However, since I'm only working with vote shares, I would have to make even more assumptions about the distribution of voters across states in order to correctly weight the national vote. Additionally, I would have to make even more assumptions about the data generating process in the other 48 states + D.C. This approach is a bit easier and demonstrates the same point.

I assume that $\mu$ is equal to President Obama's share of the two-party vote in 2012. The estimate of $\mu$ itself is irrelevant, since we're interested in differences in $\delta_i$ between Colorado and a comparable state (the $\mu$s cancel). But for the sake of the model it's helpful since it allows me to use $0$ as a reasonable "null" estimate of $v_i$.

The question is whether Colorado's $\sigma$ is statistically different from that of Virginia. If it is, then it might make sense to talk about an electoral "advantage" in 2012. Obviously this assumes that the $\sigma$ values for all of the other states in the 272 elector coalition are at least as large as Colorado's, which I'll grant is true for the sake of the model (it only works against me). Colorado is the "weakest link."

The way to answer this question is to test the null hypothesis of no difference. Suppose that $\delta_i$ equals $0$ for Virginia and Colorado - that there is no substantive difference between Virginia (the baseline state) and Colorado (the "firewall" state). What's the probability that we would observe a gap of $.7\%$ between their state-level electoral results?

Estimating the probability requires making a reasonable estimate for the variance of $\epsilon_i$. How much variation in the electoral results can we attribute to random error and how much can we attribute to substantive features of the electoral map.

Unfortunately, we can't re-play the 2012 election, so we have to look at history to calculate variance - in this case the 2008 election. As Nate notes in the post, the party coalitions are relatively stable and demographic changes/realignments typically take many election cycles to complete. As such, we may be able to assume that the state-level "structural" effects are the same in 2012 as they are in 2008. That is, the variation in the popular-vote - state-level vote from year-to-year can be used to estimate the variance of the error terms $\epsilon_{CO}$ and $\epsilon_{VA}$.

But it's hard to get a good estimate of a variance with only two data points. Four is slightly better, but to do so we have to assume that the error terms of Colorado and Virginia's vote shares are identically distributed - that  $\sigma_{CO}^2 = \sigma_{VA}^2$. That's not to say that the observed "error" values are the same, just that they're drawn from the same distribution. Is this a reasonable assumption? The residuals don't appear to be dramatically different, but we just don't know - again, another statistical trade-off. Ultimately the point of this exercise is to demonstrate how one-election-cycle observations can be reasonably explained by random variation, so the aim is plausibility versus absolute precision.

So I estimate the pooled error variance for each state as the sample variance of the 2008 and 2012 democratic two-party share in each state subtracted from the two-party share of the popular vote. In theory, we could add more electoral cycles to the estimate, but the assumption that $\sigma_i$ does not change from year to year becomes weaker. If I restrict myself to just looking at electoral results, then I have to accept data limitations. This is a general problem with any statistical study of elections - if we look only at percentage aggregates at high levels, there just isn't a lot of data to work with.

The next step is simulation. Suppose we repeated the 2012 election a large number of times on the basis of this model. How often would we see a gap of at least .78% between the vote shares in Colorado and
Virginia? The histogram below plots the simulated distribution of that difference. The x-axis gives the percentage differences .01 = 1%). The red dotted line represents a difference of .78 percentage points. Also relevant is the blue dotted line, which represents a difference of -.78 percentage points (Virginia's vote share is greater than Colorado's).



What's the probability of seeing a difference as extreme as the gap seen in 2012? Turns out, it's about 40% - certainly not incredibly high, but also not unlikely.

Suppose we only care about positive differences, that is, we are absolutely sure that it's impossible for Virginia to be structurally more favorable to the Democrats than Colorado. There is either no difference or $\delta_{CO} > \delta_{VA}$. What's the probability of seeing a difference equal to or greater than .78 percentage points? Well, it's still roughly 20%. Statisticians (as a norm, not as a hard rule) tend to use 5% as the cut-off for rejecting the "null hypothesis" and accepting that there is likely some underlying difference in the parameters not attributable to random variation - in this case, we fail to reject.

If the error variance were smaller, that is, if the amount of year-to-year variation that can be ascribed to randomness were lowered, it's possible that a gap of .78% would be surprising. This would lead us to conclude that there indeed may be a structural difference - that Colorado is more advantageous to the Democratic candidate relative to the baseline. The details of the model parameters are really not the point of this exercise. Any estimate of the variance from the sparse data used here is not very reliable. The fact that we are using two elections where the same candidate stood for office suggests non-independence and error correlation which would likely downwardly bias our variance estimates. We could look at the gap in 2008 - two observations are better than one, but in that case, why not include the whole of electoral history - the data exist. Moreover, to make our counterfactual predictions more precise, we need some covariates - independent variables that are good predictors of vote share. In short, we need real statistical models.

This is what the Gelman et. al. paper does. While a quick comparison of the data points hints at a structural bias in 2012, a more in-depth modeling approach suggests that difference is not statistically significant. That is, it's likely that any apparent structural advantage in the Electoral College in 2012 is nothing more than noise.

The problem with Nate's argument, ultimately, is that it posits a counterfactual (Romney winning the popular vote by a slight margin) without describing the uncertainty around that counterfactual. Without talking about uncertainty, it is impossible to discern whether the observed phenomenon is a result of chance or something substantively interesting.

It does appear that I'm spending a lot of time dissecting a rather trivial point, which is true. My goal in this post was not to focus on whether or not the "Electoral College advantage" is true or not - the 2012/2008 election results alone aren't enough data to make that determination. Rather, I wanted to walk through a simple example of inference to demonstrate why it's important to pay attention to probability and randomness when talking about data - to think about data in a statistical manner.

We cannot, just by looking at a set of data points, immediately explain which differences are due to structural factors and which ones are due to randomness. No matter what, we make assumptions to draw inferences about the underlying structure. Ignoring randomness doesn't make it go away - it just means making extremely tenuous assumptions about the data (namely, zero variance).

To summarize, if you get anything out of this post, it should be these three points
1) The data are not the parameters (the things we're interested in).
2) To infer the parameters from the data, you need to think about probability
3) Statistical models are a helpful way of understanding the uncertainty inherent in the data.

I'm not suggesting that political writers need to become statisticians. Journalism isn't academia. It doesn't have the luxury of time spent carefully analyzing the data. I'm not expecting to see regressions in every blog post I read, nor do I want to. The Atlantic, WaPo, TNR, NYT are neither Political Analysis nor the Journal of the Royal Statistical Society.

But conversely, if there is increasingly a demand for "quantitative" or "numbers-oriented" analysis of political events, then writers should make some effort to use those numbers correctly. At the very least, it's valuable to think of any empirical claim - whether retrospective or predictive - in terms of inference. We have things we know and we want to make arguments about things we don't know or cannot observe. At it's core, argumentation is about reasoning from counterfactuals and counterfactuals always carry uncertainty with them.

Even if the goal is just to describe one election cycle, one cannot get very far just comparing electoral returns at various levels of geographic aggregation. Again, election returns are just data - using them to make substantive statements, even if it's only about a single election, relies on implicit inferences which typically ignore the role of uncertainty. And if we want to go beyond just describing an election and identifying trends or features of the electoral landscape, quantitative "inference" without probability is just dart-throwing.

No comments:

Post a Comment