When private companies hold data that scholars need, what becomes of academic research?
By Henry FarrellDecember 19, 2017
What exactly was the extent of Russian meddling in the 2016 election campaign? How widespread was its infiltration of social media? And how much influence did its propaganda have on public opinion and voter behavior?
Scholars are only now starting to tackle those questions. But to answer them, academics need data — and getting that data has been a problem.
Take a recent example: Jonathan Albright, a researcher at Columbia University, looked into a number of Russia-bought pages that Facebook had taken down. He concluded that they had amassed potentially hundreds of millions of views. David Karpf, an associate professor of media and public affairs at George Washington University, wasn’t convinced, arguing that most of the “people” who had liked these pages were very likely Russian bots. (Full disclosure: I commissioned and edited Karpf’s post on The Washington Post’s Monkey Cage blog.)
We’re sorry, something went wrong.
We are unable to fully display the content of this page.
This is most likely due to a content blocker on your computer or network.
Please allow access to our site and then refresh this page.
You may then be asked to log in, create an account (if you don't already have one),
or subscribe.
If you continue to experience issues, please contact us at 202-466-1032 or help@chronicle.com.
What exactly was the extent of Russian meddling in the 2016 election campaign? How widespread was its infiltration of social media? And how much influence did its propaganda have on public opinion and voter behavior?
Scholars are only now starting to tackle those questions. But to answer them, academics need data — and getting that data has been a problem.
Take a recent example: Jonathan Albright, a researcher at Columbia University, looked into a number of Russia-bought pages that Facebook had taken down. He concluded that they had amassed potentially hundreds of millions of views. David Karpf, an associate professor of media and public affairs at George Washington University, wasn’t convinced, arguing that most of the “people” who had liked these pages were very likely Russian bots. (Full disclosure: I commissioned and edited Karpf’s post on The Washington Post’s Monkey Cage blog.)
Usually such disagreements are resolved through the data. The problem that scholars like Albright and Karpf face is that there is little publicly available data on Facebook. For his study, Albright had to use an unconventional Facebook-owned tool called CrowdTangle to find anything at all. After he had published his initial findings, Facebook quickly announced that it had “fixed a bug” in the software Albright used, making it impossible for other researchers to replicate what he did. Albright and Karpf are left in a very unhappy situation — the data that they need to understand what happened is simply no longer available.
ADVERTISEMENT
Large firms like Google, Facebook, and Apple have much better — and less accessible — data than the government. This is reshaping social-science research.
That is one example of an extraordinary change in the politics and practice of social science. Businesses like Facebook hold crucial information about people’s social and political behavior. But they are extremely reluctant to provide that data to outsiders, unless those outsiders sign nondisclosure agreements (NDAs) that give Facebook the power to sue if the information is used in ways that the company finds objectionable.
This marks a significant change for researchers. It used to be that states were the most important source of data on their citizens, economy, and society. They had to collect and aggregate large amounts of information — for example, censuses of individuals and firms — for their own purposes. In addition, state agencies helped to fund social-science projects in data gathering, such as the National Science Foundation’s support of the American National Election Studies over decades.
Consequently, the politics of data access used to be more focused on the state. Sometimes the state was reluctant to provide information, whether to protect privacy, cover up its mistakes, or keep control of sensitive information. But for the most part, it did typically provide access, and scholars could put pressure on it when it stalled. In that world, scholars could draw on common sources, and usually (although not always) had more or less equal access.
There was a downside — scholars’ questions were shaped by the data they could obtain. But the upside was that research was usually reproducible. Disagreements like that between Albright and Karpf could be conducted on equal terms.
We are now moving into a new era for social science. For many scholarly purposes, large firms like Google, Facebook, and Apple have much better data than the government — and those data are much less accessible. This new universe of private data is reshaping social-science research in ways that are badly understood.
ADVERTISEMENT
In this brave new world, data access is a jungle. There are no universal rules: Firms have their own individual policies about when they provide social scientists with common access to data, or they may do so on an ad hoc basis — or they may refuse completely. When these firms build relationships, it is typically with individual researchers, or small groups of researchers, whose work might be valuable to the firm. And those relationships are usually covered by NDAs or other contractual rules restricting how researchers can use the data and summarize it in published research.
That can have big consequences for academic careers. Some scholars — those with connections to the right firms — can prosper. Those without connections have to get creative to do their work at all. Sometimes it’s possible for them to get rough-and-ready access to aggregated data without strings, via tools like Google Trends. Sometimes they can repurpose tools that Facebook and other companies make available to advertisers or other commercial clients (CrowdTangle is one example). However, data gathered in those ways may not be suited to specific research purposes.
That is not to say that the data that come directly from the firms is perfect, or anything like it. Behind every great data set there lies a great crime. Pretty much all social-science data are biased by the assumptions and (sometimes problematic) methodologies that were used to gather it. State-constructed data sets in their heyday were flawed in many ways and continue to be. However, as professional standards improved, the flaws became better understood and more transparent.
New forms of data from private companies are more problematic. They are collected primarily for commercial purposes rather than research. They are often collected by machine-learning techniques, which produce classifications that are obscure even to their creators. The findings based on these data are fed back to reshape algorithms with an eye toward changing human behavior — for instance, making individuals more likely to click on ads — so that data are often not comparable over time.
In combination, those factors can mean that it is really hard to interpret the data. For example, to what extent might changes in behavior on Facebook be driven by underlying changes in society, and to what extent by changes to Facebook’s algorithms? Except under certain circumstances — say, when Facebook runs controlled experiments — it can be hard to say.
ADVERTISEMENT
Access restrictions pose further challenges. NDAs and other agreements may not only prevent researchers from sharing data with their colleagues but also may prevent them from providing valuable information about how the data were gathered and processed.
Together, those factors mean that we may be about to witness a collision between the reproducibility movement, which is gaining ground in the social sciences, and the new world of proprietary data, which undermines reproducibility because the information is inaccessible to others and liable to be destroyed if it does not retain commercial value.
If scholars start relying on private businesses for data, the contours of entire academic fields may become subject to pervasive forms of selection bias.
Even more worrisome, corporate control of data can lead to two kinds of selection bias. More obviously, unflattering findings will probably not be published if corporations have any say. For example, Uber funded social scientists to carry out research on whether or not its service was cheaper and faster than standard taxis. The research suggested that Uber was indeed cheaper and faster — but Uber insisted on retaining control over whether or not the results were published. It doesn’t take an especially suspicious mind to guess that Uber would have withheld permission for publication if the results had suggested that its service was worse than taxis. When businesses use proprietary access to data and legal agreements to retain control over publication, they have strong incentives to allow the publication of only material that is flattering to them. Over time, this will lead to the skewing of publicly available research.
More insidiously, if scholars start relying on private businesses for data, the contours of entire academic fields may become subject to pervasive forms of selection bias. Certain research topics and methods will be favored, while others fall by the wayside. Facebook is highly sensitive about the suggestion that its service can have any but the most innocuous political consequences. Its researchers and political scientists collaborated on a major experiment showing that Facebook prompts could make people more likely to vote — but it was notably sensitive to further inquiry about how Facebook news placement influences political behavior, deleting a YouTube video in which a Facebook researcher had described what they had done in a little too much detail. Facebook may well have big effects on politics, not only in U.S. elections but in other contexts (such as the Arab Spring). But it has no incentive to allow scholars to use its data to carry out research on most of those effects, so entire lines of inquiry may end up stillborn.
Then there’s the ethical aspect of conducting research using private-company data. Companies like Facebook, not bound by academic norms, can be tempted to make dubious ethical decisions, when, for example, they treat the media ecosystems of entire countries like mice in a laboratory experiment. Yet academics may deal no better with temptation. The Simpsons character Dr. Marvin Monroe harbors the ambition to build a “Monroe box,” in which he will keep an infant until the age of 30, subjecting it at random moments to electrocution and showers of icy water to test the hypothesis that it will resent its captor. All social scientists have a little Marvin Monroe in their hearts, and many might be tempted, if only they had the means, to send multitudes of human beings scurrying like rats through social-media mazes of subtly skewed information to see which paths they take. In a world dominated by private-company data, it becomes easier for scholars to carry out work outside the usual ethical restrictions. The authors of a 2014 study on social networks and “emotional contagion” did not have to undergo IRB approval for their work, since the experiment had already been carried out by Facebook. Expect this trend to continue as the use of private data grows.
ADVERTISEMENT
As the Albright-Karpf story shows, these issues are no longer merely academic. Facebook is undergoing intense political scrutiny because of its apparent blindness to Russian influence operations. Congressional investigators are more likely than outside scholars to succeed in insisting on access to data. The politics of data are changing, perhaps significantly. Many members of Congress find it no longer appropriate that so much of the national conversation takes place inside a black box. Other services also coming under increased scrutiny, like Twitter, have been more open — although it, too, has been capricious in its willingness to let others gain access to its data.
This will probably end in mutual frustration and confrontation. Members of Congress are not notably technically adept, and over the decades they have stripped away many of the institutions (such as the Office of Technology Assessment) that could have provided them with authoritative guidance.
Yet there is another possible path forward. Facebook and the other big players in the world of social data might relieve some of the political pressure on them by remaking their relationship with academe. It is going to be hard for these businesses to maintain the “keep your hands off” posture that they have had about their data in the past. If they are going to have to be more publicly accountable, they are probably better off building relations with scholars, who have technical understanding, than with political appointees, who typically do not.
Facebook, Google, and Twitter might agree to provide data to an independent academic observatory. This arrangement would operate under broadly agreed and explicit ethical rules. The observatory would carry out and publish research on problems arising from the abuse of social-media services by third parties (as plausibly happened with Russia), accredit trustworthy researchers who could have access to data for both original research and replication purposes, and coordinate with government and other parties with a clear and legitimate interest in combating abusive behavior.
More broadly, this observatory could provide a factual anchor for debate about the actual consequences of social media for society and politics. While technology companies might sometimes not like its findings, they would be better off if political debates were grounded in facts and data rather than in ill-informed, sometimes alarmist second-hand speculation.
ADVERTISEMENT
Such an arrangement could provide oversight without requiring the companies to completely sacrifice their business models. It could also help resolve cross-cutting security problems better than any single company could. Commercial enterprises have little incentive to share data with their competitors, since such data are usually at the heart of their business models. This leads to a general fragmentation of knowledge, in which competing firms have different kinds of data that could illustrate a problem from multiple perspectives. Russian influence operations have involved combined actions on Facebook, YouTube, Twitter, and Google. An independent center could trace those relationships across different services without compromising individual companies’ commercial needs.
All of this would involve considerable creativity and ingenuity on the part of the businesses themselves as well as the researchers whom they might work with. They would have to craft a new kind of arrangement for such an observatory, which would be similar to some existing organizations, such as the Computer Emergency Readiness Teams, or CERTs, which already play a key role in cybersecurity. Such an organization would require substantial independent funding, probably channeled through a foundation or other nonprofit arrangement. That would not only solve some of the most vexing problems of the relationship between scholarship and e-commerce but would also integrate scholarly research and big-data capacities in pursuit of important social and political objectives.
It’s not clear that this outcome is politically feasible right now. Facebook, Twitter, and Google still very likely think about their situation as a short-run public-relations problem rather than the existential crisis that it threatens to become. That is shortsighted. Crises and scandals have a tendency to escalate, especially when a lack of data means that even sophisticated researchers are forced to guess at what is actually going on. If social-media companies don’t wake up to the problems of the world they are building — one in which the most crucial information about how politics and society work is hidden behind proprietary walls and nondisclosure agreements — they are likely to find their basic business models under attack after the next major scandal, or the one after that.
Henry Farrell is a professor of political science and international affairs at George Washington University and co-chair of the advisory board of the Social Science Research Council’s Digital Culture Initiative. An earlier version of this article was published in Parameters.