The Chronicle Review

Big Data. Big Obstacles.

Stuart Bradford for The Chronicle Review

February 02, 2015

After decades of fretting over declining response rates to traditional surveys (the mainstay of 20th-century social research), an exciting new era would appear to be dawning thanks to the rise of big data. Social contagion can be studied by scraping Twitter feeds; peer effects are tested on Facebook; long-term trends in inequality and mobility can be assessed by linking tax records across years and generations; social-psychology experiments can be run on Amazon’s Mechanical Turk service; and cultural change can be mapped by studying the rise and fall of specific Google search terms. In many ways there has been no better time to be a scholar in sociology, political science, economics, or related fields.

However, what should be an opportunity for social science is now threatened by a three-headed monster of privatization, amateurization, and Balkanization. A coordinated public effort is needed to overcome all of these obstacles.

While the availability of social-media data may obviate the problem of declining response rates, it introduces all sorts of problems with the level of access that researchers enjoy. Although some data can be culled from the web—Twitter feeds and Google searches—other data sit behind proprietary firewalls. And as individual users tune up their privacy settings, the typical university or independent researcher is increasingly locked out. Unlike federally funded studies, there is no mandate for Yahoo or Alibaba to make its data publicly available. The result, we fear, is a two-tiered system of research. Scientists working for or with big Internet companies will feast on humongous data sets—and even conduct experiments—and scholars who do not work in Silicon Valley (or Alley) will be left with proverbial scraps.

We face a triple threat of privatization, amateurization, and Balkanization.

While there have historically been glorious bastions of private, for-profit research institutions—think of the Nobel prizes won at Bell Labs or the innovations of Xerox PARC (including the computer mouse and desktop icons)—that kind of private research for the public good was conducted in a very different context. Today, public investment in science is waning as federal budgets are cut and states do not fill in the funding gaps for their flagship research universities. Meanwhile, the average corporation has been transformed by the shareholder-value revolution to be much more concerned with short-term profits and thus increasingly oriented away from basic research. Social science conducted at Foursquare or Yahoo typically must serve the bottom line.

Hand in hand with the privatization of data is the amateurization of their analysis. Does time on Facebook really make us more depressed, as one recent study has claimed? Well, maybe. But perhaps it is just that depressed people spend more time alone, logging on in their dark rooms. Or to take a more consequential example, fonts of municipal data have largely been credited for improved policing by redirecting cops to high-crime areas. But the analysts driving such policy have failed to consider the endogeneity problem: Since crime is, by definition, recorded when someone reports it or when police find it, perhaps the presence of police in certain high-crime neighborhoods is part of the reason they are high crime in the first place. That is, initial conditions may have locked policing into a vicious cycle of increased concentration in certain communities. Trained social scientists are needed to deal with such big-data pitfalls as reverse causality, or others such as unobserved heterogeneity, sample-selection issues, aggregation bias, or spatial or temporal autocorrelation. There has been incredible progress in the last few decades in tackling these issues in observational social science; let’s not lose that progress by thinking that big data is only the province of computer science and applied math.

Finally, there is the issue of Balkanization. While administrative, consumer, or electronic contrails are ubiquitous, they are often insufficient. It is great to know that if researchers tweak the Facebook posts we see in our feed, they can detect changes in our mood thanks to what we, in turn, post, but what is society supposed to do with that information? If we know where gunshots are fired, thanks to geo-located, distributed sound sensors, what does that tell us about social inequality? Not much. But when these sorts of data are analyzed with, for example, traditional measures of depression or cognitive functioning, the results can be jaw-dropping. Patrick Sharkey took spatio-temporal data on Chicago homicides and linked it to an old-school assessment of cognitive performance among poor children in the city. He found that among children who had experienced a homicide in their neighborhood within the three days prior to testing (testing dates were effectively random), scores dropped. This study—by linking purely descriptive administrative data to costly psychometric assessments—has powerfully demonstrated that violent neighborhoods actually affect child development rather than just reflect it.

To address this triple threat of privatization, amateurization, and Balkanization, public social science needs to be bolstered for the 21st century. In the current political and economic climate, social scientists are not waiting for huge government investment like we saw during the Cold War. Instead, researchers have started to knit together disparate data sources by scraping, harmonizing, and geo­coding any and all information they can get their hands on.

Currently, many firms employ some well-trained social and behavioral scientists free to pursue their own research; likewise, some companies have programs by which scholars can apply to be in residence or work with their data extramurally. However, as Facebook states, its program is "by invitation only and requires an internal Facebook champion." And while Google provides services like Ngram to the public, such limited efforts at data sharing are not enough for truly transparent and replicable science.

While many folks are legitimately concerned about privacy in an era of Internet giants, we think that these private firms and public-sector agencies should be made to share their data more—not less—but with the National Science Foundation, not the National Security Agency. Professional social scientists have a long, hard-won tradition of responsible conduct and few privacy breaches. Institutional procedures are in place to ensure these data would be used for the public good.

We need to create a mechanism for all researchers to access data for scientific research. As a group funded by the NSF, we have been conducting a listening tour of social-science disciplines, collecting ideas for just such an effort.

Proposals range from a distributed approach that relies on volunteers to passively contribute information through a mobile-device app to a nationally coordinated network of 20 regional data centers, as we, ourselves, have proposed. Our vision is for an integrated but distributed framework or platform that is both scalable (from local to national) and flexible (having core data but also data that are site specific and locally relevant). Regional data centers would undertake the task of linking a broad array of information—administrative data, media and social media, Census and other surveys, ethnographic data, and data from experiments such as randomized controlled trials. Such data-sharing solutions could also facilitate novel cross-linkages. Rather than just apply to Yahoo to work with its data in a silo, researchers would be able to link such proprietary data to other, diverse sources of information including those of other firms and government agencies, or even to newly collected information.

To be clear, we are not advocating the abandonment of nationally representative, long-running scientific treasures like the Panel Study of Income Dynamics, the National Election Study, or the General Social Survey; we think connecting such studies to other, novel forms of data only serves to strengthen them. We are not naïve about other perils of social science in the era of big data, including privacy breaches, but we are certain that such disasters (and others) are more likely to befall us if social scientists are not active participants in the big-data revolution.

Dalton Conley, New York U.
J. Lawrence Aber, New York U.
Henry Brady, the U. of California at Berkeley
Susan Cutter, the U. of South Carolina
Catherine Eckel, Texas A&M U.
Barbara Entwisle, the U. of North Carolina at Chapel Hill
Darrick Hamilton, the New School
Sandra Hofferth, the U. of Maryland at College Park
Klaus Hubacek, the U. of Maryland at College Park
Emilio Moran, Michigan State U.
John Scholz, Florida State U.