The Chronicle Review

Counting on Google Books

Michael Morgenstern for The Chronicle Review

December 16, 2010

Humanities scholars may someday count as a watershed the paper that appeared on Wednesday in Science, titled "Quantitative Analysis of Culture Using Millions of Digitized Books." But they'll have certain things to get past before they can appreciate that.

The paper describes some examples of quantitative analysis performed on what is by far the largest corpus ever assembled for humanities and social-science research. Culled from Google Books, it contains more than five million books published between 1800 and 2000—at a rough estimate, 4 percent of all books ever published—of which two-thirds are in English and the others distributed among Chinese, French, German, Hebrew, Russian, and Spanish. The English corpus alone contains some 360 billion words, a size that permits analyses on a scale that aren't possible with collections like the Corpus of Historical American English, at Brigham Young University, which tops out at a mere 410 million words.

Not everyone will find these statistics bracing. A lot of scholars have reservations about studying literature en bloc, mindful of Seneca's admonition that distrahit animum librorum multitudo, or loosely, "Too many books spoil the prof." And they're apprehensive about the prospect of turning literary scholarship into an engineering problem.

The framing of the Science paper will aggravate those qualms. The authors of the paper claim that the quantitative data gathered from the corpus are the bones that can be assembled into "the skeleton of a new science." They call the new field "culturomics," defining it as "the application of high-throughput data collection and analysis to the study of human culture," which "extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities."

That's culturomics with a long o, with the implication that the object of study is the "culturome," presumably the mass of structured information that characterizes a culture. The point of comparison might be biological models of evolution or simply the idea that culture, like the genome, can be "cracked" via massive distributed (that is, "high-throughput") processing.

The inspiration for the Science paper came from two young Harvard researchers, Jean-Baptiste Michel and Erez Lieberman-Aiden, with backgrounds in genomics and mathematics. And almost all of the more than 12 authors of the paper (11 individuals plus "the Google Books team") are mathematicians, scientists, or engineers—some at Google, the rest mostly at Harvard or the Massachusetts Institute of Technology. The very fact that the paper was submitted to Science suggests that the authors are more interested in winning the ear of their scientific colleagues than in reaching the scholars who will be the primary beneficiaries of this new approach. Having glimpsed a new domain from a peak in Darien, the authors' first thought was to call home.

It's hard to imagine anything likelier to raise the hackles of humanists or cultural historians, who aren't disposed to think of their fields on the model, say, of pre-Mendelian biology. But there's nothing in the research that compels this understanding of "culturomics." Indeed, a close reading of the paper clarifies the limits of quantitative corpus investigations as well as the power. Where we're going we'll still need readers.

Humanists and social scientists have been doing quantitative corpus research for a long time, in fields like linguistics, political science, and intellectual history. But the Google project does initiate a new phase. Of course there is the jump in scale, not just in the size of the corpus but also in the staggering processing power that the researchers can throw at it. And what it takes Google's server farms to do right now, anyone with a home computer will be able to do tomorrow. Like most everything else, a terabyte isn't what it used to be. You can already fit everything that's ever been written in the glove compartment of your Hyundai; within a few years it will fit in your eyeglass frames.

Scholars can't download the entire corpus right now, but the impediments are legal and commercial rather than technological. (Google could make available a corpus of all the public-domain works published through 1922 without raising any copyright issues, but it has decided not to do that.) In the meantime, scholars have access to the corpus via the Web sites, and At this point, they are confined to examining the "trajectories" of individual words or strings of up to five words long ("we don't need no badges"), in the form of a graph that shows the relative frequency of a word over some period between 1800 and 2000, or that compares the frequency of use of several words. (Scholars can also download a visualization tool and the full set of trajectories, but not the texts they're drawn from.)

That leaves out a lot, compared with what you can do with other corpora. As of now, for example, you can't ask for a list of the words that follow the adjective "traditional" for each decade from 1900 to 2000 in order of descending frequency, or restrict a search for "bronzino" to paragraphs that contain "fish" and don't contain "painting." Some of those capabilities will probably be available soon, though users won't be able to replicate many of the computationally heavy-duty exercises that the researchers report in the paper, and linguists won't really be happy until they can download the whole corpus and have their way with it.

And while the Harvard researchers have purged the research corpus of a large proportion of the metadata errors that have plagued Google Books, there are still a fair number of misdated works, and there's no way to restrict a query by genre or topic. You can ask the system to plot the trajectory "dear reader" in books published in Britain during the 19th century, but you can't limit the search to novels.

But in the end, the most important consequence of the Science paper, and of allowing public access to the data, is that it puts "culturomics" into conversational play. Whatever misgivings scholars may have about the larger enterprise, the data will be a lot of fun to play around with. And for some—especially students, I imagine—it will be a kind of gateway drug that leads to more-serious involvement in quantitative research.

Short of reassigning humanities and social-science departments to the engineering school, how might all of this change disciplines? The exercises in the Science paper are meant to suggest the range of possibilities. A couple of these fit neatly into continuing scholarship. In one exercise, the researchers computed the rates at which irregular English verbs became regular over the past two centuries. The patterns that emerged will be grist for theories of language evolution, But quantitative methods are already widely accepted in the field, and in any case, morphological change is a "cultural" phenomenon only by courtesy of the dean of humanities. Another ingenious study uses quantitative methods to detect the suppression of the names of artists and intellectuals in books published in Nazi Germany, the Stalinist Soviet Union, and contemporary China. Those results could be published tomorrow in a history journal, but precisely because they're consistent with other kinds of data that historians are already using; they won't shift any disciplinary paradigms.

The more interesting exercises are also, in a way, the most problematic. In one exercise, the authors investigate the evolution of fame, as measured by the relative frequency of mentions of people's names. They began with the 740,000 people with entries in Wikipedia and sorted them by birth date, picking the 50 most frequently mentioned names from each birth year (so that the 1882 cohort contained Felix Frankfurter and Virginia Woolf, and so on). Next they plotted the median frequency of mention for each cohort over time and looked for historical tendencies. It turns out that people become famous more quickly and reach a greater maximum fame now than they did 100 years ago, but that their fame dies out more rapidly. You can take that result as a quantitative demonstration of the rise of what Leo Braudy called "disposable fame" in his book The Frenzy of Renown, which the authors cite. And the technique could be a powerful source of data for the burgeoning field of celebrity studies, as it's designated in the title of a new journal from Routledge.

But the method isn't up to distinguishing among the varieties of fame and eminence that Braudy and others have carved out. And there are obvious limits to equating fame with mere frequency of mention. At one point, for example, the authors observe that "'Galileo', 'Darwin', and 'Einstein' may be well-known scientists, but 'Freud' is more deeply ingrained in our collective subconscious." But it defies belief that Freud is vastly better known than Darwin among the authors of books in a corpus that was drawn from the collections of research libraries. We simply mention Freud more often. Maybe that's because we refer to Darwin only when we're talking about evolution, while we're apt to bring up Freud when we're talking about ourselves. Or maybe there's some other explanation. But the data don't wear their cultural significance on their sleeves; they need cultural historians to speak for them.

I have a friend, a gifted amateur musician and computer scientist, who was involved in electronic music in its early days. Inevitably, within a few years, the field was taken over by composers. That happened partly because new interfaces made the technology more accessible, but also because a command of the subject matter always trumps mere technical expertise. As my friend put it, "It's a lot easier to turn an artist into a geek than to turn a geek into an artist."

In the same way, we'll know that the program of quantitative corpus research is successful when the engineers have stepped back as the techniques are absorbed into the academy, sometimes as a method, sometimes just as a background of operating assumptions. That was the fate of 19th-century philology—the study of "La Vie des Mots" (The Life of Words) in the title of a book of the period by Arsène Darmesteter. Quantitative corpus studies are destined to play the same role, though they imply a different understanding of what the life of words is all about. We really don't even need a name like "culturomics," or any new name at all: this is just e-philology. (Or "the newer philology," since "the new philology" is taken.)

One salutary effect of looking at word trajectories is that they dispel some of the unreflective philological assumptions that color the way humanists and social scientists tend to think about words. Take the obsession with origins, in particular the genealogical model of vocabulary change that's implicit in the structure of major dictionaries. Scholars speak of new words or word senses "entering the language" at a specific date, with the implication that they bring new concepts along with them. But decades or even centuries can pass before a "new" word gains a purchase in the language. "Propaganda" had something like its modern sense by Carlyle's time, in the 19th century, but it was a recondite item; only with World War I did it enter "into the vocabulary of peasants and ditch diggers," as one contemporary put it. Between 1914 and 1950, its frequency in the print news media increased tenfold, only to fall back significantly by 2000. It isn't that people have lost interest in the thing the word denotes, as you might conclude from the falling frequency of "slide rule" or "Dinah Shore." But we think of political discourse differently now (the decline of "propaganda" coincides with the rise of "Orwellian," as it happens).

Then, too, comparing word trajectories enables you to pin down the emergence of new vocabularies that are the harbingers of cultural regime change—the signs, as Quentin Skinner put it, that "society has entered into the self-conscious possession of a new concept." The Oxford English Dictionary documents the first appearance of "lifestyle" in 1915, but it wasn't until the late 1960s that the word became commonplace (in 1967 it appeared in the Chicago Tribune just 29 times; by 1972 the figure was 1,571). That coincided with a sharp increase in the use of "demographic," which first appeared in 1882 but became 50 times as frequent from the 1950s to the 1970s, spinning off the noun "demographics" in the process—all part of an emerging vocabulary (with the appearance of terms like "upscale" and "trendy," and of new senses for "blue collar" and "preppie") that reflected the consumerization of class. In the age before corpora, there was no way to get a handle on this phenomenon. (It's a fair bet that Raymond Williams's influential 1976 book, Keywords: A Vocabulary of Culture and Society, would have looked very different if he had had access to the Google Books corpus and not just to the OED.)

The most obvious—though not the only—application of these techniques is in analyzing broad swaths of cultural and literary production, what Franco Moretti, of Stanford, calls "distant reading," which examines hundreds or even thousands of texts at a swoop. But there's nothing in the Science paper that threatens the importance of close reading, New Historicist anecdotalism, or any of the other more ruminative forms of scholarship. On the contrary, there needn't even be a sharp division between the two approaches. These new results are very often just intriguing quantitative nuggets that call out for narrative explication. Scientists like to say that "data" is not the plural of "anecdote," but sometimes "anecdotes" can be the plural of "data." And, like other anecdotes, they don't compel any single interpretation, and sometimes even bring us back to the texts they were abstracted from.

Consider an interesting study of the titles of 19th-century books by the historians Dan Cohen and Fred Gibbs, of George Mason University, who also worked with the Google Books corpus. What does it signify that the words "hope" and "happiness" became less frequent in book titles in the second half of that century? To Cohen and Gibbs, it suggests that there was an undercurrent of depression during that period. But a reader of Schopenhauer might conclude that all those earlier mentions of happiness were the unmistakable signs of misery and abjection. To prove the case one way or the other, one might be driven to, well, read some of the books.

Some people worry that the effect of these quantitative studies will be to trivialize scholarship. In a news article that appeared in The Chronicle last spring about Moretti's research, Katie Trumpener, a professor of comparative literature at Yale, voiced her concerns about the quantitative turn in literary studies. It's all well and good when it's done by an original thinker like Moretti, she said, but what happens when it's taken up by his "dullard" descendants? "If the whole field did that, that would be a disaster," with everyone producing insignificant numbers and "jumped-up claims about what they mean."

It's unlikely that "the whole field" of literary studies—or any other field—will take up these methods, though the data will probably figure in the literature the way observations about origins and etymology do now. But I think Trumpener is quite right to predict that second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions. But it isn't as if those scholars would be doing more valuable work if they were approaching literature from some other point of view.

This should reassure humanists about the immutably nonscientific status of their fields. Theories of what makes science science come and go, but one constant is that it proceeds by the aggregation of increments great and small, so that even the dullards have something to contribute. As William Whewell, who coined the word "scientist," put it, "Nothing which was done was useless or unessential." Humanists produce reams of work that is precisely that: useless because it's merely adequate. And the humanities resist the standardizations of method that make possible the structured collaborations of science, with the inevitable loss of individual voice. Whatever precedents yesterday's article in Science may establish for the humanities, the 12-author paper won't be one of them.

Geoffrey Nunberg, a linguist, is an adjunct full professor in the School of Information at the University of California at Berkeley.