More than a decade ago, the pioneering computer scientist Jim Gray referred to data science as the “fourth paradigm” of discovery, along with experimentation, theory, and numerical simulation.
Today it doesn’t take a visionary to see that rapid advances in our ability to acquire, generate, and analyze data are influencing scholarly work in nearly every academic field. Higher education, though, has been slow to adapt to this trend. Only a few years ago, a provost who wanted to understand what sort of computing would be required by researchers in the future would probably have consulted faculty members who labeled themselves “computational scientists.” Their answer might have been: “Additional subsidized cycles to run my simulations.”
Numerical simulation continues to be of great importance. Today’s data science, however, relies far more on intellectual infrastructure than on physical infrastructure: new methods, tools, partnerships, and types of researchers, plus the institutional change required to create new career paths and reward structures. But most researchers — even the very best — are not well versed in the tools and approaches of what we call data-intensive discovery. Until recently, for example, one could be a world-class oceanographer without possessing knowledge of data science. But no more. Oceanography, like so many other disciplines, is becoming an information field, through rapid advances in chemical, physical, biological, and video sensors that stream data with unprecedented volume, velocity, and variety; remotely operated vehicles; and observatories that extend the internet to the seafloor. The sophisticated analysis of data and innovation in data-analysis methods have become integral to the field.
The tools and scholarly approaches of data science, meanwhile, are still in their infancy and evolving rapidly, but academic-reward systems do not provide adequate incentives to make them more accessible, reliable, and effective. Furthermore, essential partnerships between the fields that specialize in data-science methodology (computer science, statistics, applied mathematics) and those that need to employ such methodology to drive research (life sciences, environmental sciences, physical sciences, and social sciences, among others) are not well developed. Such partnerships are difficult to fund and sustain because they must bridge disparate academic fields and require extended effort — notorious challenges for federal research agencies, peer reviewers, and university appointment-and-promotion committees. There are significant cultural mismatches between the entrenched structures of universities and the needs of data-intensive research.
In 2013 the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation selected New York University, the University of California at Berkeley, and my institution, the University of Washington, to join in a collaborative experiment intended to transform the process of data-intensive research and the institutional environments in which it takes place. As part of the Data Science Environments Partnership, we formed working groups across our institutions to identify institutional impediments and take actions to overcome them. Among our recommendations so far:
- Colleges need new and better ways to support the careers of data scientists. We must create and sustain long-term career trajectories for a new generation of scientists whose research depends crucially on the analysis of complex data, and whose work may require substantial curation or development. We must reward scientists who focus on building next-generation tools that others will use to advance science.
- Colleges must rapidly and dramatically improve education and training in data science. Training is required at all levels: for undergraduate and graduate students, postdocs, and members of the research staff and faculty. It must be tailored to meet the needs and mathematical or computing backgrounds of various disciplines.
- Colleges must help develop an ecosystem of new tools and software environments. Today’s tools and software environments distract from the science that should be the focus. The research community itself is best positioned to tackle this challenge. How can colleges recognize and encourage the development, sharing, and integration of software tools and environments that support data-intensive research?
- Colleges must encourage reproducible and open data science. As data-intensive research grows in importance, we have the opportunity to create software tools and practices that support the sharing and reuse of data, software, and scientific procedures, allowing us to spend more time standing on one another’s shoulders and less time standing on one another’s toes.
- Colleges must create physical and intellectual spaces to help expand the data-science community. Physical spaces for collaboration are essential for facilitating work that crosses disciplinary boundaries. We must recreate the “water cooler” where researchers from different fields interact and discover common problems and common solutions.
Four years into the Data Science Environments project, we certainly don’t have all the answers, but we have some of them, and our effort can offer guidance to educational leaders seeking to accelerate data-intensive research on their campuses. An extensive discussion of our successes and challenges may be found at msdse.org/creating_institutional_change.html.
The creation of our Data Science Environments — campuswide settings in which data-intensive discovery can flourish — was supported by multiple funding sources, particularly the Moore and Sloan foundations. The continuing cost of sustaining those activities is a few million dollars per year — a tiny amount when compared with the budgets of most major research universities.
Our data environments wield enormous leverage, providing an intellectual infrastructure that can revolutionize the process of discovery in a broad range of fields. We are optimistic that our institutions will soon consider these environments as essential to our missions as computing and libraries.
Ed Lazowska holds the Bill & Melinda Gates Chair in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, and in 2007 became founding director of the University of Washington eScience Institute.