Patrick J. Curran struggles with the problem when studying alcoholism in families. Quynh C. Nguyen sees it when analyzing housing-voucher programs. And the Nobel laureate Harold E. Varmus encounters it while developing genomic databases for cancer patients.
Their trouble isn’t with sharing their data — all three professors are eager participants in the open-data revolution.
Instead, the problem is confidently sharing and interpreting data — huge amounts of it — with relevance and accuracy.
We’re sorry, something went wrong.
We are unable to fully display the content of this page.
This is most likely due to a content blocker on your computer or network.
Please allow access to our site and then refresh this page.
You may then be asked to log in, create an account (if you don't already have one),
or subscribe.
If you continue to experience issues, please contact us at 202-466-1032 or help@chronicle.com.
Patrick J. Curran struggles with the problem when studying alcoholism in families. Quynh C. Nguyen sees it when analyzing housing-voucher programs. And the Nobel laureate Harold E. Varmus encounters it while developing genomic databases for cancer patients.
Their trouble isn’t with sharing their data — all three professors are eager participants in the open-data revolution.
Instead, the problem is confidently sharing and interpreting data — huge amounts of it — with relevance and accuracy.
As they and other scientists embrace sharing, they’re finding that computer systems are quite good at storing and easing access to the enormous quantities of information they generate. But comparing and synthesizing all that data, in differing formats and styles and methods, requires human skill and judgment. And even the best aren’t sure how to do it, raising questions of whether the nationwide rush toward open data will really mean a momentous revolution in scientific progress or just a whole new level of gnarlyreproducibilityissues.
Mr. Curran is a professor of psychology at the University of North Carolina at Chapel Hill who studies the effects of alcoholic parents on their children. He combines findings from multiple studies and sees a challenge lurking in the varied scientific meanings and assessments that professional colleagues apply to terms such as “anxiety” and “depression.”
ADVERTISEMENT
“The thing that keeps me up at night,” he said, “is, Am I making a substantive theoretical conclusion that is based on some artifact of how we scored the scale?”
The thing that keeps me up at night is, Am I making a substantive theoretical conclusion that is based on some artifact of how we scored the scale?
Such questions represent a much-less-discussed aspect of the push for open science: Some researchers believe that the intricacies of translating and synchronizing data accurately are getting too little attention, even though reproducibility is already a major struggle. And those details will only grow more important as scientists begin writing the code for a future in which computers routinely extract answers from data piles far too big for any human to handle.
Dr. Varmus, a university professor of medicine at Weill Cornell Medical College and former director of the National Institutes of Health, is one of the world’s best-known cancer experts. His concerns over data synchronization include the ways of describing for computerized processing the multitude of ways that patients develop tumors and respond to drugs or radiation or other therapies.
“It becomes a very complicated business that has to be settled, in my view, at a very early stage of having accepted terms for what we do to patients and how they respond,” Dr. Varmus told an NIH conference on open data this month.
Accuracy and Precision
Over all, the benefits of computerizing data are expected to vastly outweigh the worries, in virtually all academic fields. In medicine, it could mean defeating devastating diseases by gleaning crucial insights into patient experiences that have already occurred, during thousands of costly clinical trials, rather than spending many years and millions of dollars conducting new trials.
ADVERTISEMENT
Before those benefits can be realized, however, scientists face the daunting task of meshing the many differences in the questions, procedures, notational styles, and measurement units seen in vast data collections.
Ms. Nguyen, an assistant professor of health promotion and education at the University of Utah, uses social-media data to assess the health of urban neighborhoods and the success of government policies such as housing vouchers. The accuracy-related challenges, she said, include handling the differences in neighborhood boundaries used by various public agencies. Some researchers, Ms. Nguyen said, accept broad overlaps in residential identifications for the sake of convenience. But that comes as a cost, she said. “You definitely lose accuracy and precision that way,” she said.
Such case-by-case assessments are one issue. Potentially bigger decisions await as notational styles and conversion formulas for each academic field develop and harden. Then, some scientists fear, they may be in the position of using statistical shortcuts that become automated and amplified during the process of combining databases.
For his work on alcoholism in families, Mr. Curran has been using three different data sets housed at Arizona State University, the University of Michigan, and the University of Missouri. All three are aimed at studying children with and without an alcoholic parent, but involve different developmental periods, different measures of core behaviors, and widely differing community types — suburban Phoenix, rural areas around Columbia, Mo., and urban Detroit and Lansing.
It’s a major job just to consider the immediate challenges involved in synchronizing such data, Mr. Curran said. “I hadn’t really thought about the impact, especially the longer-term impact, of how laying down some of these common definitions and terminologies sets up expectations and directions for research” in the future, he said.
ADVERTISEMENT
That same process is taking place across many academic fields, said Brian A. Nosek, a co-founder and director of the nonprofit Center for Open Science. A single set of field-specific data-conversion standards probably won’t emerge for many academic disciplines until there’s enough shared data in open formats to make such standards absolutely necessary, Mr. Nosek said.
Major funding agencies such as the NIH and the National Science Foundation have been financing work to create the standards and the conversion systems for databases that already exist. That process is fairly advanced in some fields, such as genomics, where the variables are relatively discrete. But a lot more progress is needed in the social sciences, where terms of reference tend to be highly subjective.
And in many cases, Mr. Curran said, the amount of grant support for data integration is unrealistically small. “It turns out this is vastly more complicated than you anticipate,” he said.
Neither the NIH nor the NSF could provide figures on how much money they spend on such work. It’s difficult to count because the job of synchronizing terminologies is often considered part of individual grant support, said William T. Riley, director of the NIH’s Office of Behavioral and Social Sciences Research.
One approach being evaluated by the Defense Advanced Research Projects Agency, which specializes in novel solutions, involves the use of artificial intelligence strategies. As an experiment, the agency, which is known as Darpa, gave a team from the Rensselaer Polytechnic Institute some 300 terabytes of largely unlabeled data from tests of how a collection of composite metal samples made in various ways performed under tests related to flight worthiness.
ADVERTISEMENT
Using data-analysis strategies that a law firm might use to extract information from a large collection of emails, the RPI scientists not only figured out what the data represented but used it to make predictions of future tests of such metals. The idea, said William C. Regli, deputy director of Darpa’s Defense Sciences Office, is to let researchers share their data without the burden of also trying to ensure that some future user can interpret it.
“It’s clear that to address this problem in kind of the existing way, we’re going to drown,” Mr. Regli said. He acknowledged, however, that standardization regimes may remain essential in fields such as the social sciences that rely on data that reflects largely subjective measures.
The Standardization Problem
Scientific standardization has long suffered from insufficient attention, said Kai R. Larsen, an associate professor of information management at the University of Colorado at Boulder. It’s a big part of the reason why scientific journals are flooded with studies that often repeat the same basic findings, over and over again, just using different terms, he said.
As one recent example, Mr. Larsen was asked to review a paper on the “internet of things” — the growing network connectivity of everyday objects. Just looking through the paper, he said, he quickly recognized it as essentially a repetition of past analyses of patterns of new-technology acceptance. As such, the paper represented to him another argument for creating a coherent database of known behavioral patterns. “I asked to be relieved of the job of reviewing this paper,” he said, “because I knew I would be very negative toward it.”
Computers clearly can do better than humans in recognizing such patterns, said Mr. Larsen, who works with NSF support to develop automated text-mining technologies for behavioral studies. But first humans need to create the systems for doing that, he said. “There’s tens of thousands of behavioral and social-science researchers out there, producing papers as fast as they can, and there’s literally a handful of projects out there subsisting on minimal grants trying to organize what they’re doing,” Mr. Larsen said. “We can’t keep up with that.”
ADVERTISEMENT
It’s ‘absolutely’ the case that the availability of data is ahead of the availability of tools that can process that information accurately and completely.
Ultimately, such experts said, the move toward greater data sharing — with accurate standards for combining databases — will depend on whether researchers can begin to show progress in using such techniques and whether universities reward them for it.
But with relatively little support from funders such as NIH and NSF, the creation of data standards is tedious and expensive, said Mark A. Musen, a professor of medicine at Stanford University and director of the Stanford Center for Biomedical Informatics Research. “Right now we’re in a situation where people do this out of the goodness of their heart,” Dr. Musen said.
That’s a recipe for reproducibility problems, said Douglas A. Mata, a clinical fellow at Harvard Medical School who uses meta-analyses to study depression in medical-school students. Meta-analyses are the more traditional method of summing up existing studies, which involves combining summary-level data from previously published studies. A future of robust data sharing could instead allow deeper findings based on analyses of individual patient-level data that form each of the published studies.
Dr. Mata said automated programs that routinely introduce inaccuracies into data-sharing protocols aren’t likely to be a major problem, because future researchers will probably have a variety of choices and methods for handling their analyses. That said, a tendency to take shortcuts — such as not using statisticians in comparison studies — is a major cause of the current reproducibility crisis in science and could continue to cause problems in the future, he said. Too many research groups are “already making use of prepackaged tools where they can just kind of unthinkingly click a button and accept whatever the program puts out,” Dr. Mata said.
Despite such warnings, the institutional push for open data in science still tends to focus more heavily on expanding access to data than on figuring out how to accurately handle huge amounts of data once it becomes available.
ADVERTISEMENT
Last week, eight leading private funders of scientific research announced the creation of the Open Research Funders Group. The group, whose members include the Bill & Melinda Gates Foundation and the Alfred P. Sloan Foundation, said in its announcement that its members “are committed to using their positions to foster more open sharing of research articles and data.”
But the group’s project coordinator, Greg Tananbaum, acknowledged in an interview that ensuring accuracy needs even more attention. It’s “absolutely” the case, Mr. Tananbaum said, that the availability of data is ahead of the availability of tools that can process that information accurately and completely. “There’s no doubt about it,” he said. “I don’t think anyone could argue otherwise.”
Paul Basken covers university research and its intersection with government policy. He can be found on Twitter @pbasken, or reached by email at paul.basken@chronicle.com.
Paul Basken was a government policy and science reporter with The Chronicle of Higher Education, where he won an annual National Press Club award for exclusives.