Erez Lieberman Aiden, a biologist and computer scientist who has made important forays into the humanities, doesn't use language in quite the same way you or I do. He has just published a book, Uncharted: Big Data as a Lens on Human Culture (Riverhead), written with Jean-Baptiste Michel, whom he met when they were graduate students at Harvard. As Aiden and I stroll across the well-tended grounds of Rice University, one of two campuses where he has new jobs, I ask if his publisher has a book tour planned.
He's not sure, but he's intrigued. "'Book tour' is a two-gram I've heard often," he muses. "I've wondered what it involves." Where others might say "that word" or "that phrase," he says "one-gram," "two-gram," or "three-gram." He is energetic and engaging in conversation, but this tic hints that he's operating on a slightly different plane.
You may be more familiar with the term "ngram," which Aiden and Michel helped to make semifamous in 2010, with the publication in Science of "Quantitative Analysis of Culture Using Millions of Digitized Books." They and a dozen other authors showed how trends in word usage could be tracked over time, particularly from 1800 to 2000, by mining the data scanned by the Google Books project. At that point, the company says, it had scanned some 12 percent of documented books, or 15 million (the number has since grown). The ngrams database, which made use of about a third of the Google Books texts, offered a glimpse of "a new science," they wrote excitedly: "culturomics," a play on "genomics."
The paper showed how ngrams could chart the "regularization" of verbs (in England "each year, a population the size of Cambridge adopts 'burned' in lieu of 'burnt'"); the number of words in the language (some one million, more than twice the figure recorded by dictionaries); and the shifting dynamics of fame (the average celebrity born in the middle of the 20th century reached a level of "initial fame" at a younger age, 29, than her early-19th-century peer, 43, and faded more quickly). An "Ngram Viewer," a public website, was launched simultaneously with the paper, allowing users to track the popularity over time of words of their choice ("best of times" versus "worst of times," "zombie" versus "werewolf").
The paper was widely seen as a landmark, even if some humanists grumbled that it was insufficiently attentive to the limitations of the source material. Its importance continues to resonate: In November, in the federal-court decision concluding that Google Books does not infringe on the copyright of authors, Judge Denny Chin cited ngrams as an example of important scholarly work that could not be done without the scanning service. (Crucially, the ngrams database does not allow the full text of scanned books to be reconstructed.) Uncharted, which summarizes the findings that ngrams have made possible and provides a tick-tock account of the research project, makes even bolder claims than the paper did for the future of this kind of research: "Big data is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower."
As that language might suggest, Aiden is not someone afraid to tackle grand themes and topics. Our campus walk sends us by a meandering route toward Rice's Center for Theoretical Biological Physics. Its members are fine-tuning a grant proposal for National Science Foundation money to continue work on, among other things, another key Aiden interest: the physical structure of the human genome.
The genome is a long string that's essentially stuffed inside the nucleus of every human cell. If it were stuffed there like a string inside a pocket, it would get hopelessly tangled, and processes like transcription and replication would be impossible. In a 2009 paper in Science, Aiden and co-authors introduced a breakthrough technique for identifying the regions of the convoluted genome that are near one another in three-dimensional space, allowing researchers to infer the string's topography. The authors presented suggestive evidence that the genome formed a "fractal globule," a mathematical shape that's extraordinarily dense yet never tangles. ("Fractal" means it exhibits the same shape at every magnification.)
The jury is out on the fractal-globule thesis, but the genome-analyzing technique, called Hi-C, has spread through the field. A 3D-printed model of a fractal globule will soon go on display in the Smithsonian's National Museum of Natural History.
Hi-C and ngrams are just two peaks in a remarkable series of publications and innovations. At 33, Aiden has had seven papers in Nature and Science. In a precursor to the ngrams paper, he (and Michel and others) explicated the patterns underpinning the regularization of verbs, from Old English to modern English: A verb that is 100 times rarer than another verb regularizes 10 times as fast, which explains why "to have" is likely to remain irregular for centuries, while "burnt" is on its way to becoming toast. He has co-written a paper for Nature that explains mathematically the conditions under which altruism is a sounder strategy for survival than competition is. He also has roughly 40 patents at various states of approval, including for technology making possible the iShoe, which is designed to sense when its wearer is unsteady and thereby to prevent falls by the elderly.
His prolificacy inspires envy and frank wonderment among his peers. Cal Newport, an assistant professor of computer science at Georgetown University who writes a blog called Study Hacks, devoted to academic productivity, wrote in 2012 that Aiden had an off-the-charts "impact instinct."
Aiden is less calculating than that might imply, according to Michel, co-author of Uncharted, who now runs a start-up company in Brooklyn, N.Y., called Quantified Labs. "He has virtually no boundaries in his curiosities," Michel says. "He's not scared of asking weird questions, and that's pretty powerful."
Job Dekker, co-director of the program in systems biology at the University of Massachusetts Medical School and a co-author of the fractal-globule paper, describes Aiden's skills this way: "He can really recognize a good problem, quickly learn what kinds of methods are out there that might be useful in solving it, somehow combine those into a new concept, and identify experts in those fields to work with."
The leap that Aiden made in the fractal-globule paper was to move from hypothesizing, one location at a time, which parts of the genome were abutting, to harnessing new gene-sequencing technology, in tandem with creative experimental techniques, to test all the regions at once.
"Culturomics," Aiden says, "is a new type of evidence in the humanities. That's all. It's not the only type, it's not the most important type."
While researchers as a class are infamous for failing to see beyond their narrow subspecialties, Aiden is often described as a Renaissance man. His undergraduate degree, from Princeton University, was in mathematics, but he completed the graduation requirements for philosophy and physics, too. He has a master's in history and has collaborated with avant-garde artists he met at Princeton. After earning a Ph.D. in applied math and health-sciences technology at Harvard, he spent three years in its interdisciplinary Society of Fellows.
But in an environment in which scientists are eyed skeptically by some humanists as would-be scholarly imperialists, can even someone with such catholic interests successfully bridge the two academic cultures?
It's good to be a rising star at a university that still has the money to recruit stars. As of December, Aiden was still overseeing the construction of a 3,100-square-foot lab at the Baylor College of Medicine, where he is an assistant professor of molecular and human genetics. He's decided that one wall, which had a handful of narrow apertures, will now have windows stretching nearly wall to wall, open to sunlight and the canyons and courtyards of Houston's vast Texas Medical Center, a city within a city. "Weren't there supposed to be two more windows on either side?" he asks the workmen as he gives me a tour. For now he's got a temporary lab on one floor of this building, a small room for his researchers, and a third space for an office. That office is spare, with empty Diet Snapples and Diet Cokes on the desk, and a few colorful 3D-printed genome models on the windowsills.
I ask him where his distinctive approach to scholarship, that "impact instinct," comes from.
"I benefit a lot from refusing to work on things I don't find captivating," he says. "It's incredibly time-saving." Earlier we'd touched on the surprising fact that he'd failed a course or two at Yeshivah of Flatbush, his Brooklyn high school. "This is the same reason I didn't love doing my homework in elementary school or high school—I felt, 'This isn't interesting. It isn't a good use of my time.' Somehow I think that trait, which was maladaptive growing up, has in some sense become adaptive."
Jumping from field to field is more or less the opposite of what young scholars are counseled to do, but Aiden says he'd feel "loose and hurtling in the world" if he didn't pursue his idiosyncratic passions: "I don't want to say that what I do is what everybody should do. We work in an ecosystem. No one scientist can do much of anything alone. It's important to have a variety of species. I have this ecological niche—it seems to work well."
Michel says Aiden could publish even more if he didn't drop projects when he got bored. He recalls Aiden's doing some work in 2007 on "bubble computers," devices that use water as circuits and bubbles rather than electrons to communicate binary information. Toward the end of the year, Michel asked him why he didn't write it up. "He told me, 'I'm more interested in what Erez will be doing in 2008 than what Erez did in 2007.'"
For all his accomplishments, Aiden jokes that he is the "uneducated one" in his family. His wife, Aviva Presser Aiden, has a Ph.D. and an M.D. from Harvard and also has appointments at both Rice and Baylor. When they married, rather than hyphenate, they chose a new last name: Aiden means Eden in Hebrew. They have a 3-year-old son, Gabriel Galileo, a 1-year-old daughter, Maayan Amara, and a third child on the way. The children seem to be off to a fast start, intellectually speaking: On a recent Sabbath, Aiden finished reading them Genesis, in biblical Hebrew—Maayan Amara's first time hearing the whole text, Gabriel Galileo's second.
If Aiden is wide-ranging and cosmopolitan in his work, culturally he is much more rooted, which may be one source of the confidence he clearly brims with. His background is Hasidic, and his reverence for his father is evident. During the Nazi period, his father's family lived in Transylvania (then part of Hungary, now part of Romania). Each morning they went into hiding, yet his grandfather emerged every day to pray, wearing a pair of tefillin containing passages from Scriptures. These were handed down to Erez when his father died, during the writing of Uncharted.
One chapter of the book documents, using German-language books (among others), how ngrams can trace the suppression of artists and writers during periods of totalitarian ascendancy. The frequency of their names in the texts—as authors or references—plummets. It's a modest contribution to Holocaust studies. Still, "one of my great regrets was that my dad didn't have a chance to read that chapter," says Aiden.
His father spent his formative years as a refugee and later began rabbinical training—in Israel, before joining its air force and attending some college. After moving to New York, in 1962, he founded a company that used high-tech grinding techniques to fabricate surgical tools and other ultraprecision instruments. That an autodidact could make a living in the world by coming up with new ideas clearly left a mark on the son.
At Princeton, Aiden started off taking one more course in his first semester than the typical student, then kept adding to the number each term, until he reached 10 or so. He also started popping in to see the mathematical biologist Martin Nowak, who had an office down the road at the Institute for Advanced Study and who has said the visits were "almost annoying"—except that "any problem I was considering, he could give me good advice about how to solve it." (Nowak later moved to Harvard and advised Aiden there.)
"I was on this jag of thinking you could understand everything from first principles," building only on mathematical and philosophical premises, Aiden recalls of his time at Princeton. Alas, he left with the dawning recognition that "most things in the world did not have the quality of a logic puzzle."
To remedy the gaps in his knowledge, he took a detour, to Yeshiva University, where he earned that master's in history, writing a thesis about the 17th-century rabbi Leone de Modena, who had criticized certain rabbinic practices. Aiden argued that this criticism derived from a desire to reconcile Jews with Christians—to find a universal church. As his father had, Aiden also began rabbinical training. I ask if that was really a possible life course for him.
"Certainly," he replies, although he also insists, "I have no conclusions on religious matters." In trying to clarify his religious views, and thinking of the many prominent scientists who have become well-known atheists, I make the mistake of asking about his "stance" vis-à-vis religion. That causes him to bristle. "Having a 'stance' on it would be like having a stance on your sibling," he says. "You have a relationship, I would put it that way. It's a relationship rather than something I can have a 'stance' about. It's a big part of my life."
The origins of the Ngram Viewer lay in Aiden's and Michel's curiosity about what they describe as a "childlike question": Why do we say "drove" and not "drived"? And why are the 10 most common English verbs irregular, though fewer than 3 percent of all verbs are? The two scholars scented a computational challenge. They recruited undergraduates to help them identify verbs in Old English grammar books, and to trace those verbs down the centuries, a slow, laborious task. By the end of the project, they noticed that those hyper-obscure grammar books were turning up online, scanned into Google Books.
Aviva Aiden had recently won a prize from Google, for her work in computational biology. Because Aiden was joining her for the ceremony, he emailed Google's head of research, Peter Norvig, to talk about the potential of mining the book data. The company was already fending off lawsuits from the Authors Guild and other groups over its scanning of texts. But Aiden and Michel convinced Google that a "shadow" data set, one that tracked word frequency but could not be reverse-engineered to recreate texts, would not pose an undue risk.
With Uncharted, Aiden says, he wanted to get across "the nitty-gritty experience" of doing a project like ngrams. Messy metadata, the information used to date and identify texts, was one problem. The authors learned that Google Books was dating all issues of many periodicals according to the date of the first issue; an algorithm called Serial Killer rooted those out. Another problem arose when a research assistant's Google internship neared its end, potentially denying the team access to the data. Although backing the project, Google was uneasy about letting the data out of the Googleplex—that is, letting Aiden and Michel have it outright. So they recruited Nowak and the Harvard psychologist Steven Pinker (both of whom were to be co-authors of the culturomics paper) to go to Mountain View, Calif., and lend gravitas to their appeal to the company.
"If we couldn't get them to release it to us, we knew we would not be in a good position to get them to release it to the world on a website," Aiden says.
Aiden and Michel offered Harvard first shot at hosting the Ngram Viewer, but Harvard said no, for legal reasons. Google agreed to do so, although Aiden thought there was a 50-50 chance that a court would issue an injunction against the project, given the unsettled state of copyright law.
Uncharted begins with a finding that contradicts a claim repeated by many historians: that "the United States are" became more common than "the United States is" as a direct result of the Civil War. In fact, the usage shift occurred gradually, and the singular did not supplant the plural until 1880. The book includes the greatest hits of the culturomics paper—the shrinking half-life of fame over time; the vast number of words that elude dictionaries, the growth in word totals since 1950—and also features a selection of striking ngram word-frequency comparisons of the sort that swept the web in the days after the paper was published. Santa versus Satan? The jolly pagan surpassed the fallen angel in the ngram corpus in 1882. "Science" overtook "religion" in 1934. "Coffee" overtook "tea" in 1968.
Yet culturomics continues to be controversial. "The reason I don't trust these guys is that there's a crassness to the attitude that technologists, or 'scientistic' types, take to this material," says Geoffrey Nunberg, a linguist who teaches in the University of California at Berkeley's School of Information. "Pinker's a perfect example of this—the suggestion that we're not making progress in the humanities, that we need to put humanities on the same footing as the sciences, we need to create testable hypotheses."
In their eagerness to embrace the new tools, humanists are misusing them, Nunberg thinks. In an article in the summer-2013 issue of Social Science History, Marc Egnal, a historian at York University, in Toronto, used ngrams to confirm that (as many scholars have argued) novelists grew less sentimental and genteel over the course of the 19th century. He pointed to the declining frequency of words such as "faithful," "church," sinful," and "pious." But that approach lacks sophistication, Nunberg argues. It fails to account sufficiently for, among other things, the fact that the composition of the ngram library changes over time, with collections of sermons constituting a greater part of the library earlier than later in the century. It's potentially just a measurement of the changing collection habits of research libraries.
Nunberg hardly rejects the ngram approach wholesale. He has made use of them in his own book Ascent of the A-Word: Assholism, the First Sixty Years, which charts the rise of the derogatory noun, alongside clinical-sounding put-downs like "narcissist," as it replaced more morally freighted language.
Some of the frustration about ngrams is directed at Google. Mark Davies, a linguist at Brigham Young University, says he has encountered few scholars in his field who use the Ngram Viewer. He thinks that is partly because there are so few ways to slice up the data using the Ngram Viewer: no way to find words near other words, for example, or to search for synonyms. And while Google added part-of-speech tagging in 2012, a much-demanded feature, there's no way for scholars to check that the tags are correct. "They took this incredible data and put it behind such a poor, poor interface," he says, "that it's really difficult, if not impossible, to get meaningful data on meaningful questions."
Google's failure to cater to the desires of linguists "is no slam on Erez or Jean-Baptiste," Davies says. "They did a great job." And Google has made its ngram database downloadable; Davies is among those working to provide a more scholar-friendly interface.
Some scholars think that ngrams and other data-mining approaches will win acceptance when scholars make use, in a single paper, of both big data and traditional textual analysis—which Aiden and Michel do not do. Ryan Cordell, an assistant professor of English at Northeastern University, calls this "zoomable reading." In a recent project for Digital Humanities Quarterly, he used text-mining of various databases to identify the 19th-century newspapers that had reprinted a story by Nathaniel Hawthorne, "The Celestial Railroad," once considered canonical but now all but forgotten. He then showed why the story, involving notions of piety more conventional than the themes usually associated with Hawthorne, would have appealed to readers, especially religious ones.
Aiden is working hardest on biology now, but there's still an active culturomics wing of his lab at Baylor, and its proudest recent offering is Bookworm. Bookworm makes it easy to turn any collection of texts into a richly searchable database; you can visualize trends, but with many more ways to slice data than the NgramViewer allows, because of copyright constraints. Examples the lab has created make use of Open Library, a collection of public-domain books; arXiv, an open-access site for scientists; and a Library of Congress database of historical newspapers called Chronicling America. It's an open-source project, so anyone with even minimal programming ability can create a Bookworm out of any collection of texts.
It drives Aiden up a wall that he and his co-author have been caricatured as suggesting that ngrams can somehow supplant traditional literary scholarship. "That's nuts. We've never said that. It's transparently ridiculous. I'm sure I've said this over a thousand times on the record, but there still remains this meme that the culturomics guys think this replaces reading books."
"Culturomics is a new type of evidence in the humanities," he says. "That's all. It's not the only type, it's not the most important type, it's not the 15th-most important type—it's a type of evidence."
Yet that humble framing does seem to be in tension with some of the more sweeping passages in Uncharted. Toward the end of the book, Aiden and Michel mention the dream, articulated by people like the 19th-century philosopher Auguste Comte and the science-fiction writer Isaac Asimov, that fundamental laws of human behavior might be uncovered, allowing researchers to extrapolate into the future. Thanks to innovations like ngrams, the two authors write, "maybe, just maybe, a predictive science of history is possible." Hubristic? Perhaps. But if this were just one more small, useful tool, could it have held the interest of Erez Aiden?