Projects Aims to Build Online Hub for Archival Materials

Andrew Shurtleff for The Chronicle

The goal of the project is to develop methods that will help researchers find additional materials relevant to a subject, such as the papers of people who were important.
May 13, 2012

In death, as in life, people don't always leave their papers in order. Letters, manuscripts, and other pieces of evidence wind up scattered among different archives, leading researchers on a paper chase as they try to hunt down what they need for their work.

"It can be hugely frustrating—especially when you make a journey cross-country to an archive, and then discover the piece you really wanted must be somewhere else (or, God forbid, rotting away in a landfill)," says Robert Townsend, deputy director of the American Historical Association, in an e-mail interview. Chasing after distributed historical records is so common that "any historian who has not suffered from that problem can't be working very hard," he wrote.

The Internet has made the hunt easier, as more archives post finding aids for their collections online. "Scholars have at least gotten to the point where they can search over the Internet for these materials," says Daniel V. Pitti, the associate director of the Institute for Advanced Technology in the Humanities, or IATH, at the University of Virginia. But what he calls "hunting and gathering" persists for document-seekers, who "a priori have to have some idea, some hunch, of where to go, because the access systems are distinct and not integrated any way."

Now imagine a central clearinghouse for those records, an online hub researchers could consult to find archival materials.

That vision drives a project of Mr. Pitti's called the Social Networks and Archival Context Project, or SNAC. It's a collaboration between researchers and developers at IATH, the University of California at Berkeley's School of Information, and the California Digital Library. The project recently finished its pilot stage with the help of a grant from the National Endowment for the Humanities. Another grant, from the Andrew W. Mellon Foundation, will support the project through another two years as it adds millions more records and begins beta testing with researchers.

Some people have already found the prototype, which is up and running although not yet widely promoted. The site allows visitors to search for the names of individuals, corporate entities, or families to find "archival context records" for them.

"So if I'm interested in a particular person," Mr. Pitti says, "I can find where all the records are that would be required to understand them." For instance, a search for Robert Oppenheimer turns up a link to a collection of the physicist's papers housed at the Library of Congress, plus links to other collections in which he is referenced, a biographical timeline, and a list of occupations and subjects related to his life and work.

A researcher can explore a person's social and cultural environment with SNAC's radial-graph feature. It creates a web, which can be manipulated, of a subject's connections as revealed in archival records. The radial graph of Oppenheimer's network, for instance, includes George Kennan, Linus Pauling, Bertrand Russell, and Albert Schweitzer, among many other names represented as nodes on the graph.

Not yet fully developed, the radial-graph feature supports one of the project's main goals: to visualize the social networks within which archival records were created. "What you're trying to do is put together the puzzle, the fabric of someone's life, the people that influenced them and the people they influenced," Mr. Pitti says. "One could certainly, in an analog context, piece this together, but it would take years and years of work. What we're demonstrating is that we can go out there and gather all that information and present it to you, which would liberate scholars." Connecting archival data can reveal patterns of association hidden in disparate collections.

Data Quality Important

To work well, SNAC requires good data. Its first phase drew on thousands of finding aids—encoded with a standard known as Encoded Archival Description, or EAD—from the Library of Congress, the Northwest Digital Archives, the Online Archive of California, and Virginia Heritage. A newer standard for encoding archival information, referred to as EAC-CPF, for Encoded Archival Context-Corporate Bodies, Persons, and Families, was then applied to those records, making them easier to find and connect.

Archives are idiosyncratic, and it's not always easy to tell whether a name refers to a particular individual or to different people with identical or similar names. One of Mr. Pitti's main collaborators is Ray R. Larson, a professor in the School of Information at the University of California at Berkeley. He concentrates on what Mr. Pitti calls the "matching and merging" required to winnow out duplicate names, find variants of the same name, and so on. To do that Mr. Larson has tested several approaches, including machine learning, in which a computer is programmed to recognize, for example, common variations in spelling.

The job is about to get much tougher, though, because SNAC is about to get much bigger. As part of the second phase of the project, supported by the Mellon grant, 13 state and regional archival consortia and more than 35 university and national repositories in the United States, Britain, and France will contribute records. The British Library "is giving me 300,000 names associated with their manuscript collections," going back to before the Christian era, says Mr. Pitti.

The project will also ingest as many as 2 million standardized bibliographic records, in the widely used MARC format, from the online OCLC collaboration in which libraries exchange research and cataloging information. OCLC has its own centralized archival search function, called ArchiveGrid; Mr. Pitti describes it as complementary to SNAC. Unlike SNAC, though, "ArchiveGrid does not foreground the biographical-historical data, nor does it reveal the social networks that interrelate the archival resources," he says.

Researchers want to be able to make those connections, according to Rachael Hu, a user-experience design manager at the California Digital Library. Ms. Hu is part of the team building the SNAC prototype, drawing in part on the library's work on the Online Archive of California. "One of the things we'd been hearing from users was the need to browse and to find related collections," Ms. Hu says.

They're trying to do that with SNAC. One thing the new EAC-CPF standard "does really well is provide connections to this wealth of material that's out there," she says. If SNAC can demonstrate on a large scale that the approach works well, the standard might be adopted widely by archives.

A successful SNAC might also become a building block for a national cooperative dedicated to ensuring authoritative archival records. In late May, Mr. Pitti and his collaborators will meet at the National Archives and Records Administration in Washington to talk about that. They'll join a group of librarians, scholars, grant makers, and representatives from national agencies with a stake in archival records, including the Library of Congress, the Smithsonian Institution, the Institute of Museum and Library Services, the National Endowment for the Humanities, the National Science Foundation, and the National Park Service. The meeting will try to build consensus on the idea of a establishing a cooperative "national archival authorities infrastructure."

It's even possible to imagine that the result of this work, depending on what shape it takes, might one day dovetail with the proposed Digital Public Library of America. It could be "a natural fit," says Mr. Larson of UC-Berkeley. These days, libraries and archives "are seeing the advantage of pooling and sharing information rather than doing their own little thing."