Thursday, January 31, 2013

Why Inference Matters

This is a post inspired by a recent exchange on twitter between myself and friend, colleague, and TNR writer Nate Cohn (@electionate). The initial exchange was pretty superfluous, but it got me thinking about a broader question how writers about politics should approach quantitative data. That spiraled into this rather long-winded post. Indeed, the original conversation gets lost, but I think a more important broader point gets made. As political journalists and analysts incorporate more data into their writing, they would benefit by thinking in terms of statistical inference. It's just not possible to meaningfully talk about "real-world" data - electoral, or otherwise - without also talking about probability.

First, a quick summary of the discussion that led me to write this: Gary King tweeted a link to this post by Andrew C. Thomas covering a paper written by himself, King, Andrew Gelman and Jonathan Katz that examined "bias" in the Electoral College system. The authors found that despite the winner-take-all allocation of votes, there was no statistically significant evidence of bias in recent election years. Essentially, they modeled the vote share of each candidate at the district level and estimated the Electoral College results for a hypothetical election where the popular vote was tied. Andrew made a graph of their "bias" estimates - all of the 95% confidence intervals in recent years contain 0.

Nate shot back a quick tweet response asking why the results suggest no bias in 2012 when the electorally "decisive" states were 1.6 percentage points more democratic than the national popular vote.

This led to a longer and rather interesting discussion between Nate and myself on how to evaluate Electoral College bias (for what it's worth, my twitter arguments were pretty bad). Nate made some good points about differences in Obama's margin-of-victory in the states needed to win the Electoral College versus his overall popular vote margin-of-victory. He notes that Obama won the states needed to reach 272 by a minimum of 5.4% while his popular vote margin was only 3.9%. If the existence of the Electoral College exaggerated Obama's vote advantage relative to a popular vote system, then it's possible to conclude that the college was "biased" in Obama's favor.

Nate has since turned this into part of an article, arguing:
The easiest way to judge the Democrats’ newfound Electoral College advantage is by comparing individual states to the popular vote. Last November, Obama won states worth 285 electoral votes by a larger margin than the country as a whole, suggesting that Obama would have had the advantage if the popular vote were tied. But the current system appears even more troubling for Republicans when you consider the wide gap between the “tipping point” state and the national popular vote. Obama’s 270th electoral vote came from Colorado, which voted for the president by nearly 5.4 points—almost 1.8 points more than his popular vote victory. Simply put, the GOP is probably better off trying to win the national popular vote the state contests in Pennsylvania or Colorado, since the national popular vote was much closer in 2012 than the vote in those tipping point states. Obama enjoyed a similar Electoral College advantage in 2008. 
Here the wording is slightly different - the term is Electoral College "advantage" rather than Electoral College "bias," but the argument is essentially the same - currently the electoral geographic landscape is such that the Democrats benefit disproportionately - a shift to a popular vote system would help the Republicans.

The argument is interesting (and could in fact be true), but the evidence presented doesn't really say anything. The big problem is that it ignores probability. You can't make a credible argument about the "nature" of an election cycle just by comparing election results data points without a discussion of uncertainty in the data. The results of an election are not the election itself - they are data. The data are what we use to make inferences about certain aspects of one election (or many elections). This distinction is essential. We don't observe whether the Democrats have an Electoral College advantage in 2012 or whether Colorado was more favorable to the Democrats than Virginia or Ohio. We can't observe these things because all we have are the output - the results.

Studying elections is like standing outside an automotive plant and watching cars roll off the assembly line.  We never get to see how the car is put together; we only see the final product.

If we knew exactly what went into the final vote tally - that is, if we were the plant managers and not just passive observers - then we wouldn't need statistics. But reality is complicated and we're not omniscient. This is what makes statistical inference so valuable - it lets us quantify our uncertainty about complex processes.

Just for curiosity, I decided to grab the 2012 election data to take a closer look at Nate's argument. I estimated the number of electoral votes that each candidate would receive for a given share of the two-party popular vote. I altered the state-level vote shares assuming a uniform swing - each state was shifted by the difference between the "true" and "hypothesized" popular vote - and computed the corresponding "electoral vote" that the candidate would receive. For time/simplicity, I ignored the Maine/Nebraska allocation rules (including them doesn't affect the conclusions).

To clarify, I measure the Democratic candidate's share of the two-party vote as opposed to share of the total vote. That is, %Dem/(%Dem + %Rep). This measure is common in the political science literature on elections and allows us to make better comparisons by accounting for uneven variations in the third-party vote. 

Here's what the 2012 election looks like for President Obama. The dotted vertical line represents the observed vote total, the blue line marks the "tie" point and the horizontal red line represents 270 electoral votes.



This is consistent with Nate's argument. There is a space of potential popular vote outcomes where Obama loses the two-party vote but still wins the Electoral College - the "272" firewall.

How about 2008?

Same thing - about a 1 percentage point loss of the popular vote would have still mean an Electoral College victory for Barack Obama - again Colorado is key here. Indeed, the advantage appears to be even larger here than in 2012.

2004?

Here the advantage is less perceptible - barely a fifth of a percentage point. Compared to 2008 and 2012, this would suggest a "trend" towards greater pro-Democratic bias in the Electoral College

2000?


Bush obviously has the advantage here (and as we know, he lost the popular vote)

How about jumping to 1992?


Again the elder Bush has a slight advantage.

I could keep going. One could infer a story of a growing democratic advantage in the Electoral College from these five data points - it's there in the most recent two cycles and it wasn't there before. But in the end, these graphs are not at all meaningful.

The problem with what I've done above is that it's at best a convoluted summary of the data - the implied inferences about "advantage" are absurd absent a discussion of probability. Consider the assumptions behind the argument. It implicitly assumes that if we re-ran the 2012 election from the beginning, we would get the exact same results. It follows that if we re-ran the election and posited that Pres. Obama received only 50% or 49% of the two-party vote, then the results in each state would shift exactly by 1-2%.

Of course this is crazy. We easily observe that from year to year, changes in two-party vote shares are not constant across all states (this is why the uniform swing assumption is a statistical modeling assumption and not a statement of fact). Without probability our counterfactual observations of the electoral vote are nonsense, since we implicitly assume that there is zero variation in the vote share. This is certainly not the case. If we could hypothetically re-run the election, we would not expect the vote share to be exactly the same. There are a host of elements, from the campaign to the weather on election day, that could shift the results. We would expect them to be close, but we are inherently uncertain about any counterfactual scenario.

However, we cannot reason about the 2012 election without considering counterfactuals - what would the result have been had A happened instead of B. The problem is that we only get to observe the election once - we have to estimate the counterfactual, and all estimates are uncertain.

This is where thinking in terms of inference becomes useful. Political analysts want to move beyond summarizing the data (election returns) and make some meaningful explanatory argument about the election itself. It's difficult to do this in a quantitative sense without accounting for uncertainty in the counterfactuals.

Here is one way of evaluating Nate's argument using the same election result data that incorporates probability. It's very much constrained by the data, but that's kind of the point - looking just at a couple of election results doesn't tell us much.

The core question is: could the vote share gap that Nate identifies be due to chance?

Let's imagine that the Democratic party candidate's two-party vote share in any given state is modeled by the following process

$$v_i = \mu + \delta_i + \epsilon_i$$

$v_i$ represents the two-party vote share in state $i$ received by the Democratic party candidate. $\mu$ is the 'fixed' "national" component of the vote - the component of the vote accounted for by national-level factors like economic growth. It does not vary from state-to-state. $\delta_i$ is the 'fixed' state component of the vote share - it reflects the static attributes of the state like demographics. For a state like California, it would be positive. For Wyoming, negative. This is the attribute that we're interested in. In particular, can we say with confidence from the evidence that Colorado, Obama's "firewall" state, structurally favors the Democrats more than a state like Virginia where the President's electoral performance roughly matched the popular vote? Or, is the gap that we observe attributable to random variation.

That's where $\epsilon_i$ comes in. $\epsilon_i$ represents the component of error that's not unique to the election year. That is, if we were to re-run the election, the differences in the observed $v_i$ will be a function of $\epsilon_i$ taking on a different value. This represents the "idiosyncracies" of the election - weather, turnout, etc... $\epsilon_i$ is what introduces probability into the analysis.

For the sake of the model, we assume that $\epsilon_i$ is distributed normally with mean $0$ and variance $\sigma_i^2$. I'll allow each $\sigma_i$ to be different - that is, some states might exhibit more random variation than others.

Normally the analyst would then estimate the parameters of the model. However, I have neither the data nor the time to gather it (if you're interested in data, see the Gelman et. al. paper). My goal with this toy example is to show that the observed difference in the 2012 vote share between the "270th" state, Colorado, and a comparable state like Virginia, which has mirrored the popular vote share, might be reasonably attributed to random fluctuations within each state. To do so, I generate rough estimates of some parameters while making sensible assumptions about others.

As an aside, I could also have done a comparison between Colorado and a national popular vote statistic. However, since I'm only working with vote shares, I would have to make even more assumptions about the distribution of voters across states in order to correctly weight the national vote. Additionally, I would have to make even more assumptions about the data generating process in the other 48 states + D.C. This approach is a bit easier and demonstrates the same point.

I assume that $\mu$ is equal to President Obama's share of the two-party vote in 2012. The estimate of $\mu$ itself is irrelevant, since we're interested in differences in $\delta_i$ between Colorado and a comparable state (the $\mu$s cancel). But for the sake of the model it's helpful since it allows me to use $0$ as a reasonable "null" estimate of $v_i$.

The question is whether Colorado's $\sigma$ is statistically different from that of Virginia. If it is, then it might make sense to talk about an electoral "advantage" in 2012. Obviously this assumes that the $\sigma$ values for all of the other states in the 272 elector coalition are at least as large as Colorado's, which I'll grant is true for the sake of the model (it only works against me). Colorado is the "weakest link."

The way to answer this question is to test the null hypothesis of no difference. Suppose that $\delta_i$ equals $0$ for Virginia and Colorado - that there is no substantive difference between Virginia (the baseline state) and Colorado (the "firewall" state). What's the probability that we would observe a gap of $.7\%$ between their state-level electoral results?

Estimating the probability requires making a reasonable estimate for the variance of $\epsilon_i$. How much variation in the electoral results can we attribute to random error and how much can we attribute to substantive features of the electoral map.

Unfortunately, we can't re-play the 2012 election, so we have to look at history to calculate variance - in this case the 2008 election. As Nate notes in the post, the party coalitions are relatively stable and demographic changes/realignments typically take many election cycles to complete. As such, we may be able to assume that the state-level "structural" effects are the same in 2012 as they are in 2008. That is, the variation in the popular-vote - state-level vote from year-to-year can be used to estimate the variance of the error terms $\epsilon_{CO}$ and $\epsilon_{VA}$.

But it's hard to get a good estimate of a variance with only two data points. Four is slightly better, but to do so we have to assume that the error terms of Colorado and Virginia's vote shares are identically distributed - that  $\sigma_{CO}^2 = \sigma_{VA}^2$. That's not to say that the observed "error" values are the same, just that they're drawn from the same distribution. Is this a reasonable assumption? The residuals don't appear to be dramatically different, but we just don't know - again, another statistical trade-off. Ultimately the point of this exercise is to demonstrate how one-election-cycle observations can be reasonably explained by random variation, so the aim is plausibility versus absolute precision.

So I estimate the pooled error variance for each state as the sample variance of the 2008 and 2012 democratic two-party share in each state subtracted from the two-party share of the popular vote. In theory, we could add more electoral cycles to the estimate, but the assumption that $\sigma_i$ does not change from year to year becomes weaker. If I restrict myself to just looking at electoral results, then I have to accept data limitations. This is a general problem with any statistical study of elections - if we look only at percentage aggregates at high levels, there just isn't a lot of data to work with.

The next step is simulation. Suppose we repeated the 2012 election a large number of times on the basis of this model. How often would we see a gap of at least .78% between the vote shares in Colorado and
Virginia? The histogram below plots the simulated distribution of that difference. The x-axis gives the percentage differences .01 = 1%). The red dotted line represents a difference of .78 percentage points. Also relevant is the blue dotted line, which represents a difference of -.78 percentage points (Virginia's vote share is greater than Colorado's).



What's the probability of seeing a difference as extreme as the gap seen in 2012? Turns out, it's about 40% - certainly not incredibly high, but also not unlikely.

Suppose we only care about positive differences, that is, we are absolutely sure that it's impossible for Virginia to be structurally more favorable to the Democrats than Colorado. There is either no difference or $\delta_{CO} > \delta_{VA}$. What's the probability of seeing a difference equal to or greater than .78 percentage points? Well, it's still roughly 20%. Statisticians (as a norm, not as a hard rule) tend to use 5% as the cut-off for rejecting the "null hypothesis" and accepting that there is likely some underlying difference in the parameters not attributable to random variation - in this case, we fail to reject.

If the error variance were smaller, that is, if the amount of year-to-year variation that can be ascribed to randomness were lowered, it's possible that a gap of .78% would be surprising. This would lead us to conclude that there indeed may be a structural difference - that Colorado is more advantageous to the Democratic candidate relative to the baseline. The details of the model parameters are really not the point of this exercise. Any estimate of the variance from the sparse data used here is not very reliable. The fact that we are using two elections where the same candidate stood for office suggests non-independence and error correlation which would likely downwardly bias our variance estimates. We could look at the gap in 2008 - two observations are better than one, but in that case, why not include the whole of electoral history - the data exist. Moreover, to make our counterfactual predictions more precise, we need some covariates - independent variables that are good predictors of vote share. In short, we need real statistical models.

This is what the Gelman et. al. paper does. While a quick comparison of the data points hints at a structural bias in 2012, a more in-depth modeling approach suggests that difference is not statistically significant. That is, it's likely that any apparent structural advantage in the Electoral College in 2012 is nothing more than noise.

The problem with Nate's argument, ultimately, is that it posits a counterfactual (Romney winning the popular vote by a slight margin) without describing the uncertainty around that counterfactual. Without talking about uncertainty, it is impossible to discern whether the observed phenomenon is a result of chance or something substantively interesting.

It does appear that I'm spending a lot of time dissecting a rather trivial point, which is true. My goal in this post was not to focus on whether or not the "Electoral College advantage" is true or not - the 2012/2008 election results alone aren't enough data to make that determination. Rather, I wanted to walk through a simple example of inference to demonstrate why it's important to pay attention to probability and randomness when talking about data - to think about data in a statistical manner.

We cannot, just by looking at a set of data points, immediately explain which differences are due to structural factors and which ones are due to randomness. No matter what, we make assumptions to draw inferences about the underlying structure. Ignoring randomness doesn't make it go away - it just means making extremely tenuous assumptions about the data (namely, zero variance).

To summarize, if you get anything out of this post, it should be these three points
1) The data are not the parameters (the things we're interested in).
2) To infer the parameters from the data, you need to think about probability
3) Statistical models are a helpful way of understanding the uncertainty inherent in the data.

I'm not suggesting that political writers need to become statisticians. Journalism isn't academia. It doesn't have the luxury of time spent carefully analyzing the data. I'm not expecting to see regressions in every blog post I read, nor do I want to. The Atlantic, WaPo, TNR, NYT are neither Political Analysis nor the Journal of the Royal Statistical Society.

But conversely, if there is increasingly a demand for "quantitative" or "numbers-oriented" analysis of political events, then writers should make some effort to use those numbers correctly. At the very least, it's valuable to think of any empirical claim - whether retrospective or predictive - in terms of inference. We have things we know and we want to make arguments about things we don't know or cannot observe. At it's core, argumentation is about reasoning from counterfactuals and counterfactuals always carry uncertainty with them.

Even if the goal is just to describe one election cycle, one cannot get very far just comparing electoral returns at various levels of geographic aggregation. Again, election returns are just data - using them to make substantive statements, even if it's only about a single election, relies on implicit inferences which typically ignore the role of uncertainty. And if we want to go beyond just describing an election and identifying trends or features of the electoral landscape, quantitative "inference" without probability is just dart-throwing.

Wednesday, January 23, 2013

Public Opposition to Drones in Pakistan - A Question of Wording

Professors C. Christine Fair, Karl Kaltenthaler, and William J. Miller have a new article in the Atlantic on public attitudes toward U.S. drone attacks in Pakistan. The piece is a shortened version of a longer working paper that looks at the factors affecting knowledge of and opposition to the drone program among Pakistanis. They argue that Pakistani citizens are not as universally opposed to the drone program as is commonly believed and that public opinion is fragmented. According to Pew Research, only a slim majority report that they know anything about the drone program, and of those who do know "a lot" or "a little," only 44% say that they oppose the attacks.

Fair, Kaltenthaler and Miller valuably point out that Pakistani attitudes are not homogeneous and that only a minority of Pakistanis even know about the drone program. Making broad and sweeping claims about Pakistani opinion on the drone program is difficult because the average citizen tends to know little about foreign policy issues (this is just as true in the United States as it is in Pakistan)

However, I think Fair, Kaltenthaler and Miller may be going too far in asserting that "the conventional wisdom is wrong." Their argument that only 44% of Pakistanis oppose the drone program is highly sensitive to the choice of survey question. While Pew asks a series of questions on attitudes toward drones, Fair et. al. choose to focus on only one of these questions. In doing so, they place a lot of faith in the reliability of that question as an indicator of respondents' opposition to the drone program. This is a persistent issue in all forms of survey research. Scholars are interested in some unobservable quantity (public opinion) and have to use proxies (survey responses) to infer what cannot be directly seen. They must assume that their proxy is a good one.

Looking at the other survey questions suggests that this faith may be misplaced. Although only 44% of respondents say that they "oppose" drone attacks, 74% of respondents think that the drone attacks are "very bad" and 97% think that they are either "bad" or "very bad". If both questions were proxying for the same latent variable we would not expect this extreme gap. If 17% of respondents "support" drone strikes and only 2% of respondents think that drone attacks are a good thing, then a significant majority of those who say they "support" strikes also think that they are "bad" or "very bad" - a strange puzzle. While it's not inconceivable for people to say that they support policies that they think are bad, a more likely explanation is that respondents' answers are strongly affected by the way the questions are worded and that the question used by Fair et. al. may not be a good proxy for the quantity of interest.

So what's the problem with the question? The main issue is that it asks respondents to evaluate a hypothetical future scenario rather than reflect on the existing drone program.
I'm going to read you a list of things the US might do to combat extremist groups in Pakistan. Please tell me whether you would support or oppose...conducting drone attacks in conjunction with the Pakistani government against leaders of extremist groups. 
The question is not about what the US is currently doing to combat extremist groups, it is asking instead whether respondents would support a course of action that the US might take. Importantly, this course is framed in a way that is likely more appealing than the status quo.

First, it states that these drone attacks will be conducted in "conjunction" with the Pakistani government. While I don't know how this term was translated, it certainly suggests a lot more involvement by the Pakistani government than currently exists. The Pakistani government may "tacitly approve" the existing US drone program, but it is difficult to characterize drone attacks as being conducted in "conjunction" with Islamabad. Pew's survey respondents seem to agree - a significant majority (69%) believe that the U.S. is exclusively "conducting" the drone attacks, but a plurality (47%) believe that the attacks carry the approval of the government of Pakistan. Given that the unilateral nature of the strikes is often cited as a reason for their unpopularity, a respondent may support (or at least not oppose) the proposal in the question while still opposing the drone program as it is currently conducted.

Second, the question makes no mention of possible civilian casualties or other drawbacks while highlighting the benefits of combating extremists, a threat that many Pakistanis are concerned about. This may be minor, but it matters a lot. As Fair et. al. point out, the drone debate is a low-information environment. When respondents' attitudes about a policy are not well crystallized, subtle differences in question wording that highlight different costs or benefits can have a major impact on the responses given. For example, Michael Hiscox found [gated/ungated draft] that question framing had a sizable effect on Americans' support or opposition to international trade liberalization - another case where the average respondent is typically not well-informed. Respondents exposed to a brief anti-trade introduction were 17% less likely to support free trade.

Certainly, the question that Fair et. al. use is not explicitly presenting respondents with a pro-drone viewpoint. However, it is still very likely that framing effects are skewing the responses. Consider another question asked by Pew that is simpler and more direct

Do you think these drone attacks are a very good thing, good thing, bad thing, or very bad thing?
1% said "very good", 1% said "good", 23% said "bad" and 74% said "very bad."

I'm not arguing that this question is a superior measure. For example, it may not be measuring approval of drone strikes themselves, but rather a general sentiment toward the US (among the heavily male/internet-savvy subset interviewed). It may overstate "true" opposition, suggesting discontent with the way the program has been conducted but support for the idea of drone attacks. It might even be a consequence of social desirability bias. The point is that we don't know from the data that we have. Question framing has a substantial impact on survey responses and researchers should be careful about drawing conclusions without clarifying their assumptions about what the survey questions are measuring.

The question used in the study by Fair et. al. to measure support/opposition to the US drone program is not simply asking whether Pakistanis support or oppose the US drone program. It is asking whether Pakistanis would support a hypothetical drone program coordinated with the Pakistani government. Moreover, it exclusively highlights the benefits of such a program rather than the costs. This would not be a problem if support were invariant to question wording, but this is decidedly not the case. And even with the rather favorable framing, only 17% of respondents said that they would approve of drone strikes against extremists.

So I'm a little skeptical of the claim that commentators have been grossly overestimating the level of opposition to drone strikes. I certainly would not call Pakistani opposition to the drone program a "vocal plurality." This isn't to say that improved transparency and better PR on the part of the US would do nothing to improve perceptions, but it is a very difficult task that is constrained at multiple levels. Even if the Pakistani government were to become directly involved in the drone program (very unlikely given the political consequences), the survey results suggest that it would gather the support of about 17% of the subset of Pakistanis who are currently informed about the drone program.

The most important takeaway is that we simply don't know enough from the survey data. It's just too blunt. However, it does point to issue framing and elite rhetoric as important elements of opinion-formation on drones, suggesting interesting avenues for survey experiment work in the future.

h/t to Phil Arena for tweeting the link to the Atlantic article.