Tuesday, May 26, 2015

Catch me if you can?

Last week, news that the data behind a groundbreaking field experiment purporting to show long-lasting persuasion effects in individuals' attitudes towards gay marriage had very likely been faked spread across the political science community and the internet at large. These revelations have prompted quite a bit of reflection among scholars about the importance of unwritten norms of trust in scientific research and how these sorts of frauds can be detected without bringing the research process to a grinding halt.

I think Jonathan Ladd makes an important point that we should not necessarily basing research norms and policy on detecting ex-ante the sorts of extremely bad faith fabrications like LaCour (2014). If individuals are willing to so brazenly lie about and obfuscate their research, then it is likely they will be able to circumvent any such barriers.  What shocked me the most about the LaCour fabrication was exactly how "bad faith" it was in its scope. Initially I had thought the manipulation was done to a previously collected experimental dataset - tweaking the means of the treated units in order to obtain the desired effect when one was not found upon first glance. Reading the Broockman, Kalla and Aronow's note outlining the irregularities in the study showed that the manipulations were far more extreme - a la Stapel, the observations were just made up.

Not only is this sort of cheating hard to catch, it's difficult to envision a way in which the scientific community could make it more deterrable by increasing the costs of discovery. Don Green noted in his recent interview with NYMag post-revelation:
But my puzzlement now is, if he fabricated the data, surely he must have known that when people tried to replicate his study, they would fail to do so and the truth would come out. And so why not reason backward and say, let’s do the the study properly?
The punishment being heaped upon Michael LaCour post-retraction has been so swift and severe, and it is hard to believe that the size of this punishment was unanticipated. It's hard to see the decision to fabricate as a pure risk-reward trade-off - one Science publication, no matter how prestigious, isn't worth a lifetime of ostracism for data fabrication. Rather, as Stapel's reflections suggest, there is something intrinsically "thrilling" about the process of faking data itself.

So if cheating is hard to both detect and deter ex-ante, what is to be done? Ladd is right to emphasize post-publication replication and review. As we've seen in the LaCour case, it's not a question of if fraud is detected, it's when. However, the when can be as long as decades, particularly if the manipulation is small, the costs to the scientific community in terms of "false knowledge" can be sizeable.

But there's still room for some pre-publication review strategies. If we're interested in increasing the chances that a cheater will  be detected, these strategies should aim to detect small violations and thereby push cheaters into more extreme fabrications. The more lies told, the more likely one will be detected and the entire scheme will fall apart. Note that in the LaCour case, much of the argument made by Broockman and Kalla questioning LaCour's data was made possible because a difficult-to-find dataset happened to be posted by in an unrelated scholar's replication data. This one "happy" accident was what led scholars to tear down the whole facade of the study (including whether LaCour was truthful about received grants - something which almost no scholar would think to question absent serious suspicions).

One thought that came to mind to deter more subtle manipulations, particularly in experiments, is to create a way of verifying whether a dataset has been altered during the analysis phase. How do I know that the dataset released by researchers in the replication dataset is the same as the dataset collected at the end of an experiment? The gap between study completion and publication is many years in length. While researchers could allay concerns by posting replication data prior to publication, it's hard to imagine any scholar being willing to post their data pre-analysis and open themselves up to being "scooped."

What researchers could very easily do is post online a checksum of their dataset immediately after the dataset has been collected. They would then conduct their data cleaning and analysis as usual and subsequently post their replication data after publication. Any scholar would be able to get the checksum of the posted data file and compare it to the checksum uploaded prior to analysis to confirm that the original data file has not been tampered during analysis. All modifications (data cleaning, etc...) to the file would be made transparently available in the analysis code. Because hashing algorithms are designed such that small modifications of the original file yield completely different hashes and that the chances of a "hash collision" (two different files yielding the same checksum) are extremely low, it would be very difficult for a potential cheater to make manipulations while preserving the original hash.

I could imagine such a procedure being made a component of study pre-registration as it is a very low burden on a researcher (one command line operation) and the registering organization would serve as a trusted third party in preserving the checksum. This certainly won't prevent data manipulation, but it would decrease the amount of time available to cheaters to manipulate data and possibly help increase trust in existing experimental datasets.