A Digital Library Guru Discusses New Rules on Sharing Scientific Data

Last week, a significant change went into effect at the National Science Foundation: The agency will now require researchers to submit data-management plans with their grant proposals.

Open government advocates hailed the move as the latest in a series of steps that are expanding public access to work done with taxpayer money. The policy will not go so far as to mandate public sharing of all data, which in this context could mean anything from glacier images to scientific papers to computer code. But it will “require people to essentially justify why they choose not to be open,” says Beth Noveck, a professor at New York Law School who until recently directed the White House Open Government Initiative.

You can find lots of detailed information about the change at the NSF and the Association of Research Libraries sites. We sat down with a leading data guru, Sayeed Choudhury, to get his take on what the move means for science. Mr. Choudhury, associate dean of university libraries at Johns Hopkins University, heads a project called the Data Conservancy. That effort has an NSF grant to help develop part of the foundation’s ambitious DataNet project, which seeks to build an international, large-scale data-curation network.

Q. What’s your opinion about the NSF’s change?

A. Generally speaking, there is quite a bit to be said for allowing not only other scientists, but the general public to have access to the results of federally funded research. We’ve seen some of that with the NIH PubMed Central (a free archive of life-sciences journals). There have been a couple of cases when we’ve opened up data to what one of the professors here at Hopkins likes to call “Internet scientists.” If you look at Galaxy Zoo, basically what astronomers did is open up access to images from NASA’s Hubble Space Telescope archive. And what they found is that people are much better, quite frankly, at classifying galaxies by looking at images than machines are right now. This may seem like a cute little thing, but it’s not. This is really helpful to professional astronomers for their research. It’s really taken a life of its own, in that the framework people are using, they’re now using for other kinds of science projects. So it really is not only “it’s good for taxpayers.” It actually gets much broader participation in science activities than I think you’d otherwise get. (For more on the rise of crowd science, see this story The Chronicle published last May.)

Q. How big a deal is this?

A. The way a lot of sharing happens now is like sending e-mail to each other. It’s point to point. I may read a paper, I may discover that somebody’s doing this kind of research, or I may know people and contact them. And I think there’s a lot of, “OK, here you go. Here are my files. If you have questions, I can explain it to you.” That’s fine. But I think what we are starting to see is much more distributed. It’s a little bit more like peer-to-peer networks. To me, the ultimate value of preservation of data is that I don’t need to go back to the original producer to figure out how to use it. It becomes much more systematic, rather than idiosyncratic. If that’s the case, then you build this network—it becomes part of the social fabric, rather than this point-to-point e-mail and telephone kind of exchange. I think that’s what’s potentially significant about this. But the devil’s in the details, right? We’ve got a lot of work in order to make it work like that.

Q. What is motivating these changes?

A. I do think the taxpayer issue is an important one. That’s probably the most explicit reason. There are some implicit ones as well, including the idea that if you can actually share data, preserve it, use it in responsible and meaningful ways, then you can get better science out of it. … Some publishers have a policy right now for providing free access to a lot of their journals in least-developed countries. And there’s at least some noise that they were about to change this, or some of them may have changed this. A lot of the counterarguments that have come up are that this is a really bad idea. These countries don’t have a lot of resources. And by getting access to publications, they’re able to get better science, they’re able to deal with public health issues, and so on. And I don’t think it’s any different with data.

The other aspect of this is there’s also the possibility of spurring on reuse outside of the academic or the scientific world. There could be companies that produce services around data, things of that nature that they may not be doing right now. If you think about the weather data, for example, that the National Weather Service produces. But other people use it and repackage it, the Weather Channel and people like that. So there are, in fact, for-profit uses that could come up if you release data into the public. People may be very interested in having visualization tools, for example.

Q. Practically, do you have any sense of what will change? What will we start to see—public repositories of this stuff?

A. We are thinking about where these data will reside. My impression is that there will be a combination of both centralized and decentralized approaches. What I don’t think we want is many, many data sets linked to many, many Web sites. The Web sites may go away. They may not be maintained. They may be personal Web sites rather than institutional Web sites. The data need to be curated … Documents, including even the publications within PubMed Central, are designed to be read by people. Data are born to be processed by machines. And that has very profound implications in terms of how they’re managed and accessed and preserved over time. So that’s a very practical, substantive question that has to be put out there. If we invest a lot of funding in producing new data, we have to invest some amount of funding in actually making sure that the data are preserved and can be used. So beyond that, let’s fast-forward to a world where, in fact, that is happening, and scientists know that in fact they can put their data somewhere, and it’ll be taken care of. Then people start to think about how they can do things in different ways.

We have a researcher within the Data Conservancy, Patricia Romero Lankao, who looks at climate-change research, particularly the social impacts of climate change. She’s been thinking about a whole new type of research that would be possible if you actually were able to bring together data from these different places and run different kinds of analyses. From a science perspective, you start to get people saying, “Well, OK, what if this kind of environment existed? What kinds of questions might I ask that I don’t ask today, because it’s just not practical?”

Q. How do scientists feel about the new requirement?

A. I think it varies. I think you’re going to get reactions all the way from, “I have enough to do, and I have enough documentation to produce,” all the way to, “This is good. This is, in fact, what science is about.” The most common experience that we’ve had so far is they come in the room, and there’s this sense of, “I don’t really know what this is—can you help me?” And we go through the template we’ve gotten, we go through the interview process, and I hope get them to a comfort level where they realize, “OK, I get it now. I understand why this is useful. I understand why it’s important.”

Return to Top