> Skip to content
FEATURED:
  • Student Success Resource Center
Sign In
  • News
  • Advice
  • The Review
  • Data
  • Current Issue
  • Virtual Events
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Career Resources
    • Find a Job
    • Post a Job
    • Career Resources
Sign In
  • News
  • Advice
  • The Review
  • Data
  • Current Issue
  • Virtual Events
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Career Resources
    • Find a Job
    • Post a Job
    • Career Resources
  • News
  • Advice
  • The Review
  • Data
  • Current Issue
  • Virtual Events
  • Store
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
    • Featured Products
    • Reports
    • Data
    • Collections
    • Back Issues
  • Jobs
    • Find a Job
    • Post a Job
    • Career Resources
    • Find a Job
    • Post a Job
    • Career Resources
Sign In
ADVERTISEMENT
News
  • Twitter
  • LinkedIn
  • Show more sharing options
Share
  • Twitter
  • LinkedIn
  • Facebook
  • Email
  • Copy Link URLCopied!
  • Print

Scholars Elicit a ‘Cultural Genome’ From 5.2 Million Google-Digitized Books

By  Marc Parry
December 16, 2010

The English language is going through a time of huge growth. Humanity is forgetting its history more rapidly each year. And celebrities are losing their fame faster than in the past.

Those are some of the findings in a paper published on Thursday in the journal Science by a Harvard-led team of researchers. The scholars quantified cultural trends by investigating the frequency with which words appeared over time in a database of about 5.2 million books, roughly 4 percent of all volumes ever published, according to Harvard’s announcement.

We’re sorry. Something went wrong.

We are unable to fully display the content of this page.

The most likely cause of this is a content blocker on your computer or network. Please make sure your computer, VPN, or network allows javascript and allows content to be delivered from c950.chronicle.com and chronicle.blueconic.net.

Once javascript and access to those URLs are allowed, please refresh this page. You may then be asked to log in, create an account if you don't already have one, or subscribe.

If you continue to experience issues, contact us at 202-466-1032 or help@chronicle.com

The English language is going through a time of huge growth. Humanity is forgetting its history more rapidly each year. And celebrities are losing their fame faster than in the past.

Those are some of the findings in a paper published on Thursday in the journal Science by a Harvard-led team of researchers. The scholars quantified cultural trends by investigating the frequency with which words appeared over time in a database of about 5.2 million books, roughly 4 percent of all volumes ever published, according to Harvard’s announcement.

The research team, headed by Jean-Baptiste Michel and Erez Lieberman Aiden, culled that digital “fossil record” from more than 15 million books digitized by Google and its university partners. Google is giving the public a glimpse of the researchers’ data through an online interface that lets users key in words or phrases and plot how their usage has evolved. The paper’s authors bill this as “the largest data release in the history of the humanities.”

Scholars have explored quantitative approaches to the humanities for years. What’s novel here is the volume of material. According to a Google spokeswoman, the data set of 5.2 million books includes both in- and out-of-copyright titles in several languages from 1500 to 2008. Its more than 500 billion words amount to a sequence of letters 1,000 times as long as the human genome. This “cultural genome” would stretch to the moon and back 10 times over if arranged in a straight line.

Chronicle of Higher Education

ADVERTISEMENT

A Harvard-led team used books digitized by Google to analyze the occurrence of words since 1500. A graph shows the appearance of four words from 1965 to 2008: “fry” (in red), “bake” (blue), “grill” (green), and “roast” (yellow). Photograph by Google Books

“It radically transforms what you can look at,” says Mr. Aiden, a junior fellow in Harvard’s Society of Fellows and principal investigator of the Laboratory-at-Large, part of Harvard’s School of Engineering and Applied Sciences. Mr. Aiden and Mr. Michel, a postdoctoral researcher in Harvard’s psychology department and its Program for Evolutionary Dynamics, call their approach “culturomics.”

The method’s cross-disciplinary potential is demonstrated in the Science paper’s findings:

  • The English lexicon grew by 70 percent from 1950 to 2000, with roughly 8,500 new words entering the language each year. Dictionaries don’t reflect a lot of those words. “We estimated that 52 percent of the English lexicon—the majority of the words used in English books—consists of lexical ‘dark matter’ undocumented in standard references,” the authors write.
  • Researchers tracked references to individual years to demonstrate how humanity is forgetting its past more quickly. Take “1880": It took 32 years, until 1912, for references to that year to fall by half. But references to “1973" fell by half within 10 years.
  • Compared with their 19th-century counterparts, modern celebrities are younger and more well known—but their time in the limelight is shorter. Celebrities born in 1800 initially achieved fame at an average age of 43, compared with 29 for celebrities born in 1950.
  • Mining the data set can yield insights into the effects of censorship and propaganda. The authors give the example of the Jewish artist Marc Chagall. His name comes up only once in the German corpus during the Nazi era, even as he became increasingly prominent in English-language books.

The paper and the public data-mining tool come as Google’s broader book-digitization effort remains in legal limbo. Authors and publishers have besieged that project, calling it copyright infringement, but a legal settlement has yet to be approved.

Asked how Google was protecting the copyright of the books in its new tool, a spokeswoman, Jeannie Hornung, said the publicly available data sets “cannot be reassembled into books.”

ADVERTISEMENT

Instead, the data sets “contain phrases of up to five words with counts of how often they occurred in each year,” according to a Google blog post. They include Chinese, English, French, German, Russian, and Spanish books.

Some scholars, meanwhile, have criticized the value of reading huge quantities of books with computers. In a Chronicle article this year, they warned that cranking words from deeply specific texts like grist through a mill is a recipe for lousy research. Still others have attacked the quality of Google’s data.

Mr. Aiden acknowledged that “people should be really skeptical about this,” but he urged scholars to give the tool a try for themselves.

We welcome your thoughts and questions about this article. Please email the editors or submit a letter for publication.
Technology
Marc Parry
Marc Parry wrote for The Chronicle about scholars and the work they do. Follow him on Twitter @marcparry.
ADVERTISEMENT
ADVERTISEMENT

Related Content

  • Counting on Google Books
  • Explore
    • Get Newsletters
    • Letters
    • Free Reports and Guides
    • Professional Development
    • Virtual Events
    • Chronicle Store
    • Find a Job
    Explore
    • Get Newsletters
    • Letters
    • Free Reports and Guides
    • Professional Development
    • Virtual Events
    • Chronicle Store
    • Find a Job
  • The Chronicle
    • About Us
    • DEI Commitment Statement
    • Write for Us
    • Talk to Us
    • Work at The Chronicle
    • User Agreement
    • Privacy Policy
    • California Privacy Policy
    • Site Map
    • Accessibility Statement
    The Chronicle
    • About Us
    • DEI Commitment Statement
    • Write for Us
    • Talk to Us
    • Work at The Chronicle
    • User Agreement
    • Privacy Policy
    • California Privacy Policy
    • Site Map
    • Accessibility Statement
  • Customer Assistance
    • Contact Us
    • Advertise With Us
    • Post a Job
    • Advertising Terms and Conditions
    • Reprints & Permissions
    • Do Not Sell My Personal Information
    Customer Assistance
    • Contact Us
    • Advertise With Us
    • Post a Job
    • Advertising Terms and Conditions
    • Reprints & Permissions
    • Do Not Sell My Personal Information
  • Subscribe
    • Individual Subscriptions
    • Institutional Subscriptions
    • Subscription & Account FAQ
    • Manage Newsletters
    • Manage Your Account
    Subscribe
    • Individual Subscriptions
    • Institutional Subscriptions
    • Subscription & Account FAQ
    • Manage Newsletters
    • Manage Your Account
1255 23rd Street, N.W. Washington, D.C. 20037
© 2023 The Chronicle of Higher Education
  • twitter
  • instagram
  • youtube
  • facebook
  • linkedin