Questioning Big DataApril 1, 2013
At a recent meeting of the SIAM Committee on Science Policy (December 3 and 4, Washington, DC), a few themes emerged from the wide-ranging discussions generated by the agenda and slate of visitors. "Big data," the overarching theme, was closely connected to two others: programs of most of the U.S. agencies that fund science and the increasingly interdisciplinary nature of research on important problems.
With a single comment, CSP member Fred Roberts threw all those themes into relief. A lot of agencies, he said, don't understand "big data"---do not, in fact, even know what it is. Of interest to the CSP, he continued, are the "huge opportunities and roles for mathematics with data," both in the use of existing methods to analyze data and in the development of new methods.
Roberts, director emeritus of DIMACS, the NSF-supported Center for Discrete Mathematics and Theoretical Computer Science at Rutgers, is currently director of CCICADA: Command, Control and Interoperability Center for Advanced Data Analysis (a Center of Excellence of the Department of Homeland Security). During a break in the meeting, he outlined for SIAM News the view of "big data" that he has honed in thinking about the issues in his various academic and government positions. His points, in summary, are:
■ With respect to a definition of "big data," you have a big data question if you have so much data that you don't know what to save and, in some cases, need to make that decision instantaneously. This occurs especially in certain disciplines, e.g., astrophysics.
■ It is often necessary to determine the normal state of a system in order to be able to quickly detect departures. An example is the smart grid: Operators now get data every 2–4 seconds; with new phasor technology, updates might come 10 times/second; a human won't be able to detect the state of the grid without algorithms.
■ Data now comes from a variety of sources---sensors, audio, video, among many others---and a variety of media. How do you make sense of data coming from the many different sources?
■ How do you store, query, and search data when there's so much of it?
■ How can you trust the data you have? How do you define "trust"? Social media data is an example---can Twitter and Facebook data be considered accurate?
■ You would like to make inferences and hypotheses from large amounts of data. How do you do that?
■ The problem is not just the size of the data set, but also its complexity.
■ Large data problems now come from many disciplines. Examples are NEON (National Ecological Observatory Network), a project of the National Science Foundation, and GBIF (Global Biodiversity Information Facility), an international effort to digitize all information about all living species (estimated number: between 2 and 10 million). These projects are striking for both the size and the complexity of the data sets. The intelligence community has been dealing with big data questions for some time; the Department of Homeland Security tries to do so. The financial sector grapples with huge amounts of data.
Just about every U.S. federal agency that funds science currently supports at least one major national initiative on data. Announcing a $200 million R&D initiative in big data in March 2012, the White House described the program as a way to enhance "our ability to extract knowledge and insights from large and complex collections of digital data." Several of the agencies most important to the interests of the SIAM community were represented at the CSP meeting by (knowledgeable) visitors, who described their programs and engaged in discussion with the committee members.
With articles by Barry Cipra, careers columnist Tanya Moore, and reviewer James Case, this issue of SIAM News offers a variety of perspectives on big data. Under way for future issues are reports from the unprecedentedly well-attended 2013 SIAM Conference on Computational Science and Engineering (Boston, February 25 to March 1), where big data was featured in many sessions, including the forward-looking panel discussion, "Big Data Meets Big Models."