Friday, May 16, 2014

How Bad are Duplication Problems in GDELT Events Data? Very!

Edit: Neal Caren beat me to it! He comes to the same conclusion as well - GDELT only appears de-duplicates when records are exactly the same...which misses a LOT.

Caerus Associates' Dr. Erin Simpson recently took FiveThirtyEight to task over an article about the recent kidnapping of 276 school girls in Nigeria. The problem: An appallingly poor and incredibly misleading use of the GDELT event dataset to analyze kidnapping "trends" in Nigeria. The full conversation as it unfolded on Twitter can be found preserved here.

This tweet in particular hit on an extremely important point
The disconnect that the original FiveThirtyEight article failed to take into account is that GDELT, and machine-coded events datasets in general, are not counts of events; they're counts of news reports. Assuming a one-to-one mapping between the two is extremely problematic. Increases in overall media volume or duplicate articles will inevitably give an inflated estimate of the underlying true event count.

Even more troublesome is that there's really no good fix for the duplication problem that has been developed. Normalizing by overall media volume doesn't do anything because we still don't know how many individual news reports = one event. When FiveThirtyEight writes "the database records 151 kidnappings on [April 15] and 215 the next," we don't know whether those kidnapping reports are all primarily talking about the single kidnapping in Borno State that garnered so much media attention, or 366 distinct kidnapping incidents (of course, it's the former).

How problematic is duplication? I decided to take a look at the data on Nigerian kidnappings myself to figure out just what percentage of these GDELT reports are all discussing the same exact event. Here's what I found:

First, multiple GDELT entries are often sourced from the same URL. It is very likely that these are duplicates since news reports, particularly wire articles, tend to focus on single events. I removed all entries with duplicate URLs and was left with only 98 kidnapping-related articles on April 15th and 123 on April 16th.

Second, for a story as prominent as the kidnapping in Chibok, it is very likely that multiple news sources will report on the same event. Detecting this type of duplication is a major challenge and an open area of research in the event data field. One way of going about this is to look at the URLs for a hint as to the content of the article - news website URLs often contain the entire headline. Presuming that any article about Nigerian girls on April 15th and 16th that gets coded as a kidnapping is likely talking about the same incident, I searched for all URLs in the kidnapping dataset that contained "girl" 61 of the unique URLs on April 15th and 56 of the addresses on April 16th matched the query.

So that left 37 potential kidnapping events on April 15th unrelated to the Chibok kidnapping. How many of them actually were?

One way of answering this question would be to train a classifier on the set of articles (and in fact, I think this is one way forward for larger-scale de-duplication tasks). However, 37 is not that many articles, so reading them would suffice for this post. All but two articles were either about the Chibok kidnapping or unrelated to Nigeria. These two articles both referenced the same event: a kidnapping in Kogi State.

151 GDELT reports of kidnappings in Nigeria on April 15th - 2 actual kidnapping events.

This is just one example of a general problem with GDELT - it's inability to do more than extremely basic de-duplication. Now this would be fine if it was selling itself as a tool for roughly monitoring what the news media is writing about, but GDELT wants to be an
"initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day."
This simply isn't going to happen until we can reliably use media reports to extract distinct international events.

A good de-duplication algorithm is essential for using machine coded events data for more than just monitoring media attention. Training classifiers to detect similar phrasing and word usage in the texts coded by TABARI seems like a promising means of removing duplicates, particularly because a sizable amount of work has been done on this problem in computer science and computational linguistics. However, given that so many of the texts from which GDELT is coded are unavailable for public review due to licensing restrictions, this approach is unlikely to be feasible in the short-term. This makes GDELT in its current form a rather limited dataset (except perhaps for more recent time periods where references to the sources are included). More generally, access to the source texts is a must for any effective de-duplication method - there's just too much information lost going from the text to the event code.

One last comment:

FiveThirtyEight decided to double down on GDELT and posted another article trying to map the geographic distribution of kidnapping reports in Nigeria. This still obviously suffers from many of the problems I mentioned above - these reports are all mostly talking about the same single event - but there's an additional issue introduced by using geocodes. The problem is that the geocoding algorithm works by extracting references to geographic locations and often many locations are mentioned in a single article. An article on the kidnapping may simply say that it occurred in Nigeria while others specify the sub-unit (Borno State). Moreover, if the article mentions a statement by the government in Abuja (or even has a dateline where the reporter happens to be stationed in the capital), the event derived from that article can easily get coded to "Abuja, Abuja Federal Capital Territory, Nigeria." This is likely why the FCT is bright red on FiveThirtyEight's map - it's the near-default location for ALL events in Nigeria (particularly when the federal government is involved).

Monday, March 3, 2014

Why a Nuclear Ukraine is an Empty Counterfactual

In light of Russia's military invasion of Crimea, an action that is in complete violation of its security assurances to Ukraine under the Budapest memorandum, a number of commentators have unearthed John Mearsheimer's 1993 article in Foreign Affairs arguing that Ukraine would have been better off keeping a nuclear deterrent after the fall of the Soviet Union in order to check Russian expansion.

As Walter Russell Mead succinctly put it
If President Obama does this, however, and Ukraine ends up losing chunks of territory to Russia, it is pretty much the end of a rational case for non-proliferation in many countries around the world. If Ukraine still had its nukes, it would probably still have Crimea. It gave up its nukes, got worthless paper guarantees, and also got an invasion from a more powerful and nuclear neighbor.
Indeed, this directly echoes Mearsheimer's argument for why Ukraine should have kept its arsenal
A nuclear Ukraine makes sense for two reasons. First, it is imperative to maintain peace between Russia and Ukraine. That means ensuring that the Russians, who have a history of bad relations with Ukraine, do not move to reconquer it. Ukraine cannot defend itself against a nuclear-armed Russia with conventional weapons, and no state, including the United States, is going to extend to it a meaningful security guarantee. Ukrainian nuclear weapons are the only reliable deterrent to Russian aggression. If the U.S. aim is to enhance stability in Europe, the case against a nuclear-armed Ukraine is unpersuasive. 
Second, it is unlikely that Ukraine will transfer its remaining nuclear weapons to Russia, the state it fears most. The United States and its European allies can complain bitterly about this decision, but they are not in a position to force Ukraine to go nonnuclear. Moreover, pursuing a confrontation with Ukraine over the nuclear issue raises the risks of war by making the Russians more daring, the Ukrainians more fearful, and the Americans less able to defuse a crisis between them. 
But in retrospect, Mearsheimer's second point was clearly wrong. This is precisely what Ukraine did in 1994 in exchange for a weakly enforceable negative security assurance from Russia, U.S. and U.K.. The puzzle then is why Ukraine failed to listen to Mearsheimer's sage advice and did the unthinkable - transfer its nuclear weapons to Russia.

In order to ask what would have happened had Ukraine opted to retain its arsenal, it is important to think through the entire counterfactual. The problem with the "if only Ukraine had nukes" line of argument it assumes that Russia would have tolerated a nuclear weapons state on its border in the first place. If we were to hold the world in 2014 constant and by magic turn Ukraine into a stable nuclear power, then perhaps Russia would have been deterred from occupying Crimea. But this is not the counterfactual we're interested in.

What would have happened had Ukraine decided not to return its nuclear weapons arsenal to Russia in the 1990s? Rather than allow a nuclear Ukraine on its doorstep, it is much more likely that Russia would have chosen to preempt Ukraine and secure its arsenal by force. We would have seen something very similar to the current situation in Crimea, but on a much grander scale. The bargain struck by Ukraine and Russia in 1994 was a way of avoiding such an outcome.

The logic behind this conclusion follows from a variation on Robert Powell's famous bargaining model of war analyzed in In the Shadow of Power.

To summarize, consider a scenario between two states with varying levels of power (ability to win in a war) bargaining over the division of some good. So long as the distribution of power between states is constant and war is costly, a mutually beneficial bargain should always be reached that reflects the underlying distribution of power. If one state is dissatisfied with the distribution of the good, then the satisfied state still finds it beneficial to make a concession such that the dissatisfied state is indifferent between peace and war. Assuming that striking a bargain is cheaper than fighting a war (a reasonable assumption), states will make a deal instead of going to war. This is the conventional ``inefficiency puzzle" of war - why does war occur if states incur costs in fighting and can reach a Pareto-optimal bargain that reflects the post-war outcome without having to fight?

One potential source of war that Powell identifies is a commitment problem that arises when the distribution of power between states shifts rapidly over time. Suppose that "now," state A is much stronger than state B, but in the "future," state B's power will increase relative to A. A knows that in the future, in order to avoid war with B, it will have to concede much more than it does now (since future B is stronger and in equilibrium, the distribution of goods reflects the distribution of power). If the size of the power shift is sufficiently large, A may be better off choosing to fight B when it is weaker and risk claiming the good through war, preventing the shift in power that would force it into a weaker bargaining position in the future. The cause of this war is the commitment problem facing B. B would certainly be better off preventing A from going to war in the "now," since in its weak state, it will likely lose. If it could credibly constrain itself from using its future bargaining leverage against A, then both B and A would be better off (avoiding war). However, in the absence of some third party mechanism, any promises to not exploit its future bargaining leverage made by B "now," are irrelevant once it gains power in the "future." A and B face a situation akin to a prisoner's dilemma. They are jointly better off when A does not go to war and B does not exploit its bargaining leverage, but if A does not go to war, B is best off "defecting" and using its newly acquired bargaining power to extract more concessions from A. Knowing that B will do this, A's best response is to fight and prevent B from rising.

So how does this apply to Russia and Ukraine? In the post-Soviet period, Ukraine was rising relative to Russia. It had gone from a constituent part of the Russian-dominated Soviet Union to a sovereign state in its own right. However, in the early 90s, it still remained comparatively weaker (both in size and in military capability). It had yet to fully consolidate its``inherited" military capability - and what precisely it would inherit remained up for debate. Ukraine was in a peculiar situation where it could, to some extent, "choose" how much power it would have for itself - that is, how much of the Soviet military remaining on its territory would be returned to Russia.

Ukraine's choice to give up its nuclear arsenal can be understood in the context of the model as an attempt to credibly commit to not exploit future bargaining leverage  by "smoothing" its rise relative to Russia. By giving up some of its future power (and future access to benefits), Ukraine made it more likely that Russia and Ukraine could reach the Pareto-optimal "no war" outcome in the 90s. Russia was better off choosing not to fight Ukraine and Ukraine would be better off not being invaded. Given the option to credibly commit to constrain its rise, Ukraine chose to do so. Additionally, because of substantially diminishing marginal returns to nuclear capabilities, Ukraine's "self-constraining" choice to give up its nuclear capability did not appreciably increase Russia's leverage over Ukraine. A Ukraine with 1,000 nuclear weapons is much more threatening to Russia than one with zero. Conversely, a Russia with 7,000 nuclear weapons is relatively comparable to one with 8,000.

Under the Powell model, had Ukraine chosen to not give up its nuclear weapons, Russia would have been much more likely at the time to take preemptive action to secure its arsenal by force. The deterrence argument is moot. If nuclear weapons had any meaningful deterrent effect on Russia, then Russia would likely have acted militarily in the 90s to prevent a nuclear Ukraine rather than let Ukraine wield its leverage in the future. This might have taken on a much larger character than the current action in Crimea, particularly as the loyalties of the formerly Soviet "Ukrainian" military forces at the time were much less clear than they are now. Given the relative "newness" of an independent Ukrainian state, an occupation would not be out of the question. While Mearsheimer's original article briefly considered the possibility of a pre-emptive Russian attack, it dismisses it much too quickly and easily. Although it argues that a pre-emptive war between Ukraine and Russia in 1993 would be risky, it fails to run the clock back even further and consider why Russia chose to not pre-empt Ukraine at an even earlier time (when the Ukrainian command and control was less established) unless it was convinced of Ukrainian denuclearization. Perhaps Ukrainian nuclear weapons were simply not a sufficient cause for preemption in this counterfactual 1990s. But if this is the case, then it is unlikely that they would be a credible constraint on Russia now, particularly for the type of war we are seeing right now in Crimea (as opposed to a full occupation). The Kargil crisis between India and Pakistan illustrates that a conventional, limited conflict between two nuclear powers over a disputed territory is a distinct possibility - an example of the classic "stability-instability paradox." Either way, the case for a Ukrainian deterrent falls flat.

Denuclearization was a near-necessity for state survival. It is unlikely that Ukraine could have retained its nuclear weapons arsenal in a world where it refused to bargain with Russia over their removal. This explains why Ukraine settled for a relatively toothless negative security assurance in exchange for the cost of transferring its nuclear arsenal - it would not have been able to keep them either way. The weakness of its security assurance reflected the bargaining facts on the ground. Russia simply did not have to offer Ukraine much to secure its arsenal since giving up its nukes was a Pareto-improving move. At the time, the nuclear arsenal was more of a curse than a blessing for Ukraine.

I certainly condemn Russia's actions and think the rest of the world should do what it can (which is admittedly not much at the moment) to return the situation to status quo ante. But in our rush to figure out how the Crimean invasion could have been prevented, we should not imagine that a simple reversal of a 20-year-old decision could have so easily solved the current crisis. It is better to understand the reasons why states may have chosen to behave the way they did rather than attributing foreign policy decisions to "error." Nor should we conclude that the non-proliferation agenda is somehow doomed because leaders will learn from Ukraine's example. States are already fully cognizant of the utility (and, in many cases disutility) of nuclear weapons - another case isn't going to magically change their minds. The question we should be asking is why states still refrain from proliferating despite cases like Libya and Ukraine. And indeed, what the full story of Ukraine's denuclearization tells us is that Ukraine had very logical reasons for giving up its inherited deterrent. The scenario of a nuclear Ukraine in the 1990s would have been much, much worse - both for Ukraine and likely the world.

Edit: Phil Arena reminded me that this argument is very similar to a point William Spaniel makes in his dissertation. You can find a paper version of that article here ("The Theory of Butter-for-Bombs Agreements")