CSE 2011: Data: A Centuries-old Revolution in Science, Part I

July 15, 2011

Ed Seidel

Four hundred years ago, Galileo ushered in a true revolution in science by combining painstaking observations---data, which he collected in notebooks---with deep thinking to articulate mathematical descriptions of the observed systems. Building on this data-driven foundation, Newton developed a modern theory of gravitation, as well as calculus, which laid the groundwork for a comprehensive worldview governed by partial differential equations (that many of us have spent our careers trying to solve!).

Clearly, Galileo and Newton taught us well: The four centuries of modern science that followed have been a time of amazing discoveries. And less than one hundred years ago, Albert Einstein fueled this data revolution when he extended Newton's theory with his theory of general relativity built on a system of PDEs. Unfortunately, these PDEs were so complex that Einstein himself was not equipped to solve them! Nonetheless, observations generating just a notebook full of data confirmed that his theory was indeed true. Half a century later, Stephen Hawking's groundbreaking work on black holes resulted in output that can now be quantified as kilobytes of digital data.

Indeed, the methodologies of Galileo's and Newton's data-driven science, and the culture of science, with small groups thinking deeply about fundamental problems, have been at the center of the time-honored tradition of scientific research for centuries.

But if we fast-forward just 30 years from Hawking's work on black holes, we see that the world has changed tremendously. Advances of about 9 orders of magnitude in computing capability, along with deep advances in algorithms, have made many of the most complex PDEs solvable. Suddenly, we are generating data by the petabyte, in quantities that could no longer be stored in Galileo's notebooks. Dramatic, fundamental, and pervasive changes are upon us as we enter the data-intensive age of science.

The New Age of Data
A profound shift is occurring across all fields of research as technological advances enable us to tackle many truly complex challenges facing science and society. For example, not only can we now solve Einstein's complex PDEs, we can also begin to integrate other parts of physics and astronomy into studies of real-world phenomena, such as gamma-ray bursts, across the universe. At the same time, we are developing the capability to observe phenomena through all channels known to science, resulting in a diversity of data sources brought to bear on a single event. Now, all this knowledge, held in different communities, and all this data must be integrated, so that new knowledge can emerge.

Indeed, two key trends have begun to emerge:

Fundamentally, data is becoming not only the output of most scientific inquiry, but also the dominant and fundamental medium of communication among researchers across all disciplines.

Implications for Science
Galileo's vision of modern science as a data-driven activity guiding mathematical description remains, but exponential growth of the data volumes, along with their ubiquity and diversity, will require completely new thinking---specifically, new mathematical and statistical methods---to describe not only the systems under study but the data themselves. Like computation, data-intensive science will drive revolutions in mathematics: How are features, let alone new laws of nature, to be found in the vast volumes of data being collected? How can disparate data, from different instruments and multiple communities, be combined to advance knowledge? These questions will drive new discoveries in mathematics and statistics, and new techniques in computer science and machine learning, just as they will be required for progress in the underlying science domains that pose them.

Furthermore, these changes in the culture and methods of science will call for a reconsideration of policies and practices as they relate to scientific research. As knowledge creation occurs rapidly at community boundaries, as data is increasingly the main output of science, and as scientists will need to share data to collaborate, policy must be carefully developed to enable collaboration. And traditional modes of communication---namely, scientific publications---will need to develop a richer set of tools and software to support and accelerate the flow of information and to support the reproducibility of results. Openness and sharing of data will be critical to an accelerating advancement of science.

And so, we have arrived at yet another scientific revolution---a revolution in the scope, use, and production of data. The scientific rationale and implications for policy will be explored in the second part of this article.

Ed Seidel, assistant director for mathematical and physical sciences at the National Science Foundation, discussed the ideas presented here in a panel session at CSE 2011 in Reno. At NSF, he was previously director of the Office of Cyberinfrastructure. He is on leave from Louisiana State University, where he is the Floating Point Systems Professor in the Departments of Physics and Astronomy and Computer Science.

Renew SIAM · Contact Us · Site Map · Join SIAM · My Account
Facebook Twitter Youtube