Exploring Privacy and Validity in the Land of Plenty
In this current age of big data, analyzing and interpreting data correctly is quickly becoming a huge challenge. Data analysis increases not only the risk of spurious scientific discovery but also compromises privacy. In her plenary talk, “Privacy and Validity in the Land of Plenty,” at the SIAM Annual Meeting in Boston, MA, Cynthia Dwork of Microsoft Research discusses the challenges and methods of preserving privacy in data analysis. She overviewed mathematical methods ascertaining that conclusions drawn by analyzing big data sets are as accurate as possible.
Dwork first became interested in privacy in the modern age through the work of Helen Nissenbaum. She began her talk with a brief discussion of the general assumption that privacy-preserving data analysis means that we shouldn’t learn anything new about the subject in question. “The problem with this,” Dwork said “is that if we’re not going to learn anything new about people, what is the point of the data set?” However, there is a workaround for this problem; it’s not a privacy compromise if analysts would have learned the same thing had the subject not been replaced by another random member of the population. This is the case in the general solution concept of differential privacy, which is robust to the networked world and reveals fundamental truths about computing stability. “We’re learning about people, but not specifically learning about the individuals in the data set,” Dwork said.
According to Dwork, the so-called English language definition of differential privacy is that “the outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the data.” Thus, differential privacy is a strong privacy guarantee which often permits highly-accurate data analysis. Dwork described basic algorithmic techniques for achieving differential privacy, emphasizing that the stability of an algorithm is necessary to prevent overfitting. The structure of a differentially private algorithm allows researchers to minimize cumulative privacy loss. This distinction sets differential privacy apart from other techniques of maintaining privacy and prevents false discoveries from adaptivity in data analysis.
Dwork also analyzed the problem of statistical validity in adaptive scenarios where new questions and new studies depend on results and outcomes from previous studies. There is a disconnect between theoretical results and data analysis practice, since in practice, data is shared and reused with hypotheses and generation of new analyses on the basis of data discovery and conclusions from previous analyses. Dwork described studies to guarantee the validity of statistical inference in adaptive data analysis, since most datasets are representative of populations as a whole; the world “plenty” in her talk’s title refers to one giant, collective data set used by all researchers. “Science is by nature an adaptive process,” she said, “Everyone is studying the same data set and they all influence each other. If in the process of adaptive exploration the analyst finds a query for which the dataset is not representative, then she must have learned something significant about the data.”
About the Author
Lina Sorg
Managing editor, SIAM News
Lina Sorg is the managing editor of SIAM News.

Stay Up-to-Date with Email Alerts
Sign up for our monthly newsletter and emails about other topics of your choosing.