CSE 2011: Data: A Centuries-old Revolution in Science, Part IISeptember 18, 2011
Edward Seidel and Abby Deift
In Part I of this article, we discussed dramatic and rapid changes in the culture and methodology of science, from an individual, mathematical, and even computational world to a collaborative and computational and data-enabled world. Yet this change has not yet been fully realized. Take the case of an individual astronomer: Gazing through a telescope not so different from Galileo's, which centuries ago revolutionized our understanding of the universe, we see that the primary operational model in astronomy is still that of an astronomer using a telescope at a specific time to observe a specific object of interest, or a modeler using a computer for a designated amount of time to simulate an astrophysical process.
But not for long: As a harbinger of the new model of data-enabled science impacting all disciplines, the very concept of using a telescope (or other instrument) for some time is giving way to accessing data from a telescope. The proposed Large Synoptic Survey Telescope (LSST), described in Part I, will follow the concept demonstrated by the highly successful Sloan Digital Sky Survey; the relatively small SDSS telescope has resulted in hundreds of scientific publications by scientists working solely with public SDSS data sets or using them to generate questions that lead to new investigations. The plan is for the LSST to constantly survey the sky, covering it every few days and repeating the process day after day, week after week, year after year, generating a continuous movie of the Universe. Time on the instrument will not be uniquely relevant, nor will any single object; instead, perhaps a million transient objects will be captured, in the data, in a single night!
Hence, we find ourselves with both a new model for using a scientific instrument (e.g., its data are to be analyzed, served, and shared) and a new model for extracting the scientific content from the data (e.g., new mathematical, statistical, and computational algorithms will have to be developed and applied to search the data for discoveries, as no human will be capable of doing so). It is clear from this example that the pace of progress in science and engineering is accelerating, driven in part by advances in computing and data-intensive methods, and in part by the speed at which data, and therefore information and knowledge, can flow.
The fundamental shift illustrated above not only motivates serious philosophical changes in how we imagine scientific research and discovery, but also generates practical questions about how we perform scientific research, ensure its reproducibility and verification, and communicate its results.
For these reasons, policy regarding data and publications associated with scientific research is undergoing serious rethinking at all levels: in the science communities, in funding agencies, such as NSF, and in government, both in the US and internationally. How can we---stakeholders of the scientific community---enable an open networked environment in which data and knowledge will accelerate the discovery process? As data is shared, reused, and repurposed, new science will be discovered; if these results are fully searchable, new scientific results may be discovered in the data, as well as in the literature, and advanced search algorithms could make it possible to follow even scientific arguments through time and across disciplines. Further, scientific and public trust in correct results will be established more rapidly, while faulty arguments will be more quickly detected and corrected. In response, the publication environment will need to develop new norms and practices for citation and attribution so that data producers, software and tool developers, and data curators are credited for their contributions.
Scientists no longer wait to receive their paper journals on a monthly, quarterly, or annual basis to learn about developments and discoveries in their respective fields. Instead, they log on to websites---often, but not always, through their university libraries---and access scientific publications as early as the instant they are completed. And using an electronic platform enlarges the scope of the material that can be published---which can include digital media, simulations, software, and embedded charts--so that the very notion of what constitutes a publication is itself being rapidly transformed.
Advances in modern science are deeply intertwined with our approach to the production, management, curation, and sharing of modern scientific data and its increasingly close cousin, the modern publication. The traditional lines between scientific research tools, results, software used, and communication to the community through "publication" are highly blurred. Indeed, in today's world they are all expressed as digital data. Software, as the ubiquitous modern language of science, is purely digital; data is the product of all scientific research; scientists and society in general now communicate by sharing data.
Likewise, the notion of a publication is quickly moving from the printed page, to its digitized replication, to a fully modern suite of data, software, and words needed to comprehend, communicate. All of these, in all combinations, need to be considered not only as legitimate---but also necessary---components of publication, along with the intellectual credit they bring to the authors.
As science problems of interest become increasingly complex, researchers are addressing them in increasingly collaborative and multidisciplinary approaches. These approaches are possible only if knowledge is free to flow across traditional disciplinary boundaries. Effective sharing of data and knowledge, within and across disciplines, is critical to the advancement of science. Once the multiple products of research are open and searchable, knowledge that is typically locked away in a particular community's data and literature will be accessible and known to other communities.
Any changes to the publication processes need to recognize the importance of vetting, including merit and peer review. Maintaining high quality and editorial integrity in scholarly publication is critical, and these standards must be applied to data behind articles, interactive displays of data, videos of simulations, and even presentations. This clearly will require modified business models and cost structures, but fundamentally we must recognize that these costs are real and represent serious considerations for the members of our community.
As we begin to work through these issues---both within the scientific community and with its supporting institutions---we must also keep in mind Galileo's insistence on reproducibility in science: Difficult enough in the 17th century, the reproducibility and verification of scientific results in today's complex environment combining theory, experiment, simulation, data analysis, and collaboration, all expressed ultimately through software, pose enormously increased challenges.
In this digital world, openness and transparency of all such products and tools of research are not only critical to accelerating scientific and engineering progress as results are communicated more rapidly and effectively across communities; these principles also help build public trust in the nation's scientific enterprise. The impact of publicly funded scientific research is enhanced when its results are made available to the widest possible audience, at minimal cost, as quickly as possible.
Commitment to supporting science goes hand in hand with a commitment to enable the best possible science in this computation- and data-enabled era. Given these incredible and rapid changes in the methods, the culture, the conduct, and the dissemination of science, we have a responsibility as a community to embrace such change and to think deeply about how to best enable science in this new millennium.
Edward Seidel is assistant director for mathematical and physical sciences at the National Science Foundation. Abby Deift is the MPS science assistant.