Friday, May 16, 2014

How Bad are Duplication Problems in GDELT Events Data? Very!

Edit: Neal Caren beat me to it! He comes to the same conclusion as well - GDELT only appears de-duplicates when records are exactly the same...which misses a LOT.

Caerus Associates' Dr. Erin Simpson recently took FiveThirtyEight to task over an article about the recent kidnapping of 276 school girls in Nigeria. The problem: An appallingly poor and incredibly misleading use of the GDELT event dataset to analyze kidnapping "trends" in Nigeria. The full conversation as it unfolded on Twitter can be found preserved here.

This tweet in particular hit on an extremely important point
The disconnect that the original FiveThirtyEight article failed to take into account is that GDELT, and machine-coded events datasets in general, are not counts of events; they're counts of news reports. Assuming a one-to-one mapping between the two is extremely problematic. Increases in overall media volume or duplicate articles will inevitably give an inflated estimate of the underlying true event count.

Even more troublesome is that there's really no good fix for the duplication problem that has been developed. Normalizing by overall media volume doesn't do anything because we still don't know how many individual news reports = one event. When FiveThirtyEight writes "the database records 151 kidnappings on [April 15] and 215 the next," we don't know whether those kidnapping reports are all primarily talking about the single kidnapping in Borno State that garnered so much media attention, or 366 distinct kidnapping incidents (of course, it's the former).

How problematic is duplication? I decided to take a look at the data on Nigerian kidnappings myself to figure out just what percentage of these GDELT reports are all discussing the same exact event. Here's what I found:

First, multiple GDELT entries are often sourced from the same URL. It is very likely that these are duplicates since news reports, particularly wire articles, tend to focus on single events. I removed all entries with duplicate URLs and was left with only 98 kidnapping-related articles on April 15th and 123 on April 16th.

Second, for a story as prominent as the kidnapping in Chibok, it is very likely that multiple news sources will report on the same event. Detecting this type of duplication is a major challenge and an open area of research in the event data field. One way of going about this is to look at the URLs for a hint as to the content of the article - news website URLs often contain the entire headline. Presuming that any article about Nigerian girls on April 15th and 16th that gets coded as a kidnapping is likely talking about the same incident, I searched for all URLs in the kidnapping dataset that contained "girl" 61 of the unique URLs on April 15th and 56 of the addresses on April 16th matched the query.

So that left 37 potential kidnapping events on April 15th unrelated to the Chibok kidnapping. How many of them actually were?

One way of answering this question would be to train a classifier on the set of articles (and in fact, I think this is one way forward for larger-scale de-duplication tasks). However, 37 is not that many articles, so reading them would suffice for this post. All but two articles were either about the Chibok kidnapping or unrelated to Nigeria. These two articles both referenced the same event: a kidnapping in Kogi State.

151 GDELT reports of kidnappings in Nigeria on April 15th - 2 actual kidnapping events.

This is just one example of a general problem with GDELT - it's inability to do more than extremely basic de-duplication. Now this would be fine if it was selling itself as a tool for roughly monitoring what the news media is writing about, but GDELT wants to be an
"initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day."
This simply isn't going to happen until we can reliably use media reports to extract distinct international events.

A good de-duplication algorithm is essential for using machine coded events data for more than just monitoring media attention. Training classifiers to detect similar phrasing and word usage in the texts coded by TABARI seems like a promising means of removing duplicates, particularly because a sizable amount of work has been done on this problem in computer science and computational linguistics. However, given that so many of the texts from which GDELT is coded are unavailable for public review due to licensing restrictions, this approach is unlikely to be feasible in the short-term. This makes GDELT in its current form a rather limited dataset (except perhaps for more recent time periods where references to the sources are included). More generally, access to the source texts is a must for any effective de-duplication method - there's just too much information lost going from the text to the event code.

One last comment:

FiveThirtyEight decided to double down on GDELT and posted another article trying to map the geographic distribution of kidnapping reports in Nigeria. This still obviously suffers from many of the problems I mentioned above - these reports are all mostly talking about the same single event - but there's an additional issue introduced by using geocodes. The problem is that the geocoding algorithm works by extracting references to geographic locations and often many locations are mentioned in a single article. An article on the kidnapping may simply say that it occurred in Nigeria while others specify the sub-unit (Borno State). Moreover, if the article mentions a statement by the government in Abuja (or even has a dateline where the reporter happens to be stationed in the capital), the event derived from that article can easily get coded to "Abuja, Abuja Federal Capital Territory, Nigeria." This is likely why the FCT is bright red on FiveThirtyEight's map - it's the near-default location for ALL events in Nigeria (particularly when the federal government is involved).

1 comment: