> Skip to content
FEATURED:
  • The Evolution of Race in Admissions
Sign In
  • News
  • Advice
  • The Review
  • Data
  • Current Issue
  • Virtual Events
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Career Resources
    • Find a Job
    • Post a Job
    • Career Resources
Sign In
  • News
  • Advice
  • The Review
  • Data
  • Current Issue
  • Virtual Events
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Career Resources
    • Find a Job
    • Post a Job
    • Career Resources
  • News
  • Advice
  • The Review
  • Data
  • Current Issue
  • Virtual Events
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Career Resources
    • Find a Job
    • Post a Job
    • Career Resources
Sign In
ADVERTISEMENT
Research
  • Twitter
  • LinkedIn
  • Show more sharing options
Share
  • Twitter
  • LinkedIn
  • Facebook
  • Email
  • Copy Link URLCopied!
  • Print

Recent Big-Data Struggles Are ‘Birthing Pains,’ Researchers Say

By  Marc Parry
March 28, 2014
A recent paper dissecting the failures of Google Flu Trends, a flu-monitoring system that became a Big Data poster child, suggested to some critics that the field had been overhyped. But researchers say it still holds great promise, if done right.
Justin Sullivan, Getty Images
A recent paper dissecting the failures of Google Flu Trends, a flu-monitoring system that became a Big Data poster child, suggested to some critics that the field had been overhyped. But researchers say it still holds great promise, if done right.

In 2009, David Lazer sounded the call for a fresh approach to social science. By analyzing large-scale data about human behavior—from social-network profiles to transit-card swipes—researchers could “transform our understanding of our lives, organizations, and societies,” Mr. Lazer, a professor of political science and computer science at Northeastern University, wrote in Science. The professor, joined by 14 co-authors, dubbed this field “computational social science.”

This month Mr. Lazer published a new Science article that seemed to dump a bucket of cold water on such data-mining excitement. The paper dissected the failures of Google Flu Trends, a flu-monitoring system that became a Big Data poster child. The technology, which mines people’s flu-related search queries to detect outbreaks, had been “persistently overestimating” flu prevalence, Mr. Lazer and three colleagues wrote. Its creators suffered from “Big Data hubris.” An onslaught of headlines and tweets followed. The reaction, from some, boiled down to this: Aha! Big Data has been overhyped. It’s bunk.

We’re sorry. Something went wrong.

We are unable to fully display the content of this page.

The most likely cause of this is a content blocker on your computer or network. Please make sure your computer, VPN, or network allows javascript and allows content to be delivered from c950.chronicle.com and chronicle.blueconic.net.

Once javascript and access to those URLs are allowed, please refresh this page. You may then be asked to log in, create an account if you don't already have one, or subscribe.

If you continue to experience issues, contact us at 202-466-1032 or help@chronicle.com

In 2009, David Lazer sounded the call for a fresh approach to social science. By analyzing large-scale data about human behavior—from social-network profiles to transit-card swipes—researchers could “transform our understanding of our lives, organizations, and societies,” Mr. Lazer, a professor of political science and computer science at Northeastern University, wrote in Science. The professor, joined by 14 co-authors, dubbed this field “computational social science.”

This month Mr. Lazer published a new Science article that seemed to dump a bucket of cold water on such data-mining excitement. The paper dissected the failures of Google Flu Trends, a flu-monitoring system that became a Big Data poster child. The technology, which mines people’s flu-related search queries to detect outbreaks, had been “persistently overestimating” flu prevalence, Mr. Lazer and three colleagues wrote. Its creators suffered from “Big Data hubris.” An onslaught of headlines and tweets followed. The reaction, from some, boiled down to this: Aha! Big Data has been overhyped. It’s bunk.

Not so, says Mr. Lazer, who remains “hugely” bullish on Big Data. “I would be quite distressed if this resulted in less resources being invested in Big Data,” he says in an interview. Mr. Lazer calls the episode “a good moment for Big Data, because it reflects the fact that there’s some degree of maturing. Saying ‘Big Data’ isn’t enough. You gotta be about doing Big Data right.”

Among the academics reading and sharing it, Mr. Lazer’s article has fed a conversation about what it means, exactly, to “do Big Data right.” How do you study data gathered by companies that constantly tweak their services for business reasons? How do you make such data transparent? How do you train people who can bridge the divide between social science and computer science?

The conversation comes as other scholars, too, are puncturing inflated claims for Big Data. In a recent Pacific Standard article, two political-science professors, John Sides and Lynn Vavreck, debunked a media meme from President Obama’s 2012 campaign: that he won re-election thanks to his operatives’ data “wizardry.” And two other researchers have exposed flaws in studies that mine social-media behavior to discover people’s demographic traits.

ADVERTISEMENT

Unstable Algorithm

Google Flu Trends made its debut in 2008. By analyzing the flu-related searches of Google users, the system estimates the prevalence of flu outbreaks in almost real time—information that, in theory, could be used to direct resources and save lives. By contrast, there is a lag of roughly two weeks in flu estimates produced by the Centers for Disease Control and Prevention, which are based on reports from labs across the country. “We can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day,” Google scientists wrote in a widely cited 2009 Nature paper.

Last year, though, the Google project experienced what Mr. Lazer describes as a “Dewey Beats Truman” moment. Nature reported that Google Flu Trends was estimating more than twice the proportion of flu-related physician visits than the CDC. In reality, according to Mr. Lazer’s new research, Google Flu Trends has been “systematically overshooting by wide margins for three years.”

So how did it go off the rails? One likely explanation is the unstable nature of Google’s search algorithm, Mr. Lazer says, which was adjusted in various ways that probably increased the number of flu-related searches. (Google has tried to make it easier for its users to search for health-related information.) Google Flu Trends should have been modified accordingly. It wasn’t. And that reflects a core problem facing researchers who hope to use such data: “Most big data that have received popular attention,” Mr. Lazer wrote in Science, “are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.”

Related problems have emerged with efforts to use Twitter as a source of research data. Sociologists and market researchers, among others, are interested in developing tools to infer the demographic attributes—often not explicitly stated—of various online populations. Can you train a machine to glean a Twitter user’s political beliefs, for example, by his use of terms like “Obamacare” or "#p2,” a progressive hashtag? Early on, computer scientists reported that their automated tools could infer political orientation with upward of 95 percent accuracy.

Derek Ruths, an assistant professor of computer science at McGill University, wrote one of those optimistic papers. But then it dawned on him that these computer models, purportedly so accurate, were based on analysis of the most partisan, politically active Twitter users—a minority of the population. He and a master’s student, Raviv Cohen, conceived a study that corrected for that. What would happen when their models were tested on “normal” Twitter users, the kind of people who don’t tweet about politics so much and don’t use such partisan language?

ADVERTISEMENT

The results were sobering, as Mr. Ruths and Mr. Cohen reported last year in a paper titled, “Classifying Political Orientation on Twitter: It’s Not Easy!” Past researchers had been “systematically overoptimistic” in the claims they made about machines’ ability to infer political orientation. When standard techniques were tested on the “normal” population of Twitter users, methods that had reported greater than 90 percent accuracy achieved barely 65 percent.

The emerging problems highlight another challenge: bridging the “Grand Canyon,” as Mr. Lazer calls it, between “social scientists who aren’t computationally talented and computer scientists who aren’t social-scientifically talented.” As universities are set up now, he says, “it would be very weird” for a computer scientist to teach courses to social-science doctoral students, or for a social scientist to teach research methods to information-science students. Both, he says, should be happening.

Mr. Lazer and others see the potential for enormous rewards if they do. Nicholas Christakis, a social scientist and physician who directs the Human Nature Lab at Yale University, ticks off a list of questions that Big Data can address: What are the origins of tastes and norms? Where do people’s desires come from? How do collective phenomena emerge from individual actions? “We’re witnessing the birth of a new kind of social science,” he says. Current struggles are “the birthing pains of that process.”

Erez Lieberman Aiden, a biologist and computer scientist who has mined Google Books to study linguistic and cultural evolution, points out what’s on the horizon. People are beginning to use Google Glass to record everything they see. The “Human Speechome Project"—an MIT Media Lab effort conceived to study language development by recording nearly everything a single child hears and sees from birth to age 3—provides a glimpse of how scientists might profit from such media records. Mr. Aiden speculates that someone will be born in the next 20 years whose life generates both a biography and a complete visual transcript.

In other words, Big Data techniques are here to stay. And they’re going to get bigger.

ADVERTISEMENT

We welcome your thoughts and questions about this article. Please email the editors or submit a letter for publication.
Scholarship & Research
Marc Parry
Marc Parry wrote for The Chronicle about scholars and the work they do. Follow him on Twitter @marcparry.
ADVERTISEMENT
ADVERTISEMENT
  • Explore
    • Get Newsletters
    • Letters
    • Free Reports and Guides
    • Blogs
    • Virtual Events
    • Chronicle Store
    • Find a Job
    Explore
    • Get Newsletters
    • Letters
    • Free Reports and Guides
    • Blogs
    • Virtual Events
    • Chronicle Store
    • Find a Job
  • The Chronicle
    • About Us
    • DEI Commitment Statement
    • Write for Us
    • Talk to Us
    • Work at The Chronicle
    • User Agreement
    • Privacy Policy
    • California Privacy Policy
    • Site Map
    • Accessibility Statement
    The Chronicle
    • About Us
    • DEI Commitment Statement
    • Write for Us
    • Talk to Us
    • Work at The Chronicle
    • User Agreement
    • Privacy Policy
    • California Privacy Policy
    • Site Map
    • Accessibility Statement
  • Customer Assistance
    • Contact Us
    • Advertise With Us
    • Post a Job
    • Advertising Terms and Conditions
    • Reprints & Permissions
    • Do Not Sell My Personal Information
    Customer Assistance
    • Contact Us
    • Advertise With Us
    • Post a Job
    • Advertising Terms and Conditions
    • Reprints & Permissions
    • Do Not Sell My Personal Information
  • Subscribe
    • Individual Subscriptions
    • Institutional Subscriptions
    • Subscription & Account FAQ
    • Manage Newsletters
    • Manage Your Account
    Subscribe
    • Individual Subscriptions
    • Institutional Subscriptions
    • Subscription & Account FAQ
    • Manage Newsletters
    • Manage Your Account
1255 23rd Street, N.W. Washington, D.C. 20037
© 2023 The Chronicle of Higher Education
  • twitter
  • instagram
  • youtube
  • facebook
  • linkedin