Statistics and Data Mining Meet Biology at SDM07August 7, 2007
Mathematically inclined members of the data-mining community convened in Minneapolis, April 26–28, for the Seventh SIAM International Conference on Data Mining. As in past years, the most popular features of the program were the four keynote talks, each of which is briefly described here. Whether by chance or by design, biology is a topic that came up frequently, and the keynote talks were no exception. Tom Mitchell of Carnegie Mellon University ("Machine Learning for Analyzing Brain Activity") focused on the use of brain imaging data to study human cognition, and Ajay Royyuru of IBM Research ("Deep Computing in Biology") discussed challenges that arise in the analysis of biological and biomedical data.
Presenting results from various ongoing studies at CMU, Mitchell showed how data from functional magnetic resonance imaging can be used as a proxy measure of the neural activity that occurs in the brains of subjects presented with various visual and textual stimuli. As machine learning experts, Mitchell's team is investigating the training of classification algorithms from fMRI data, asking whether a word category (or the word itself) can be predicted from an observed fMRI pattern. Some early interesting results indicate that there may indeed be a learnable pattern of brain activity, across modality and language of stimulus presentation, for a specific semantic concept. These learned representations appear to be common across people, opening the possibility that classification based on fMRI data might be used to decode brain activity for new instances (of people). An additional result demonstrates that classifiers can accurately distinguish between separate but closely related individual concepts. The project is now investigating differences in learning in a subject presented with a concept as a text phrase or as an image; results could be useful in teaching people with learning disabilities.
Royyuru used examples drawn from a variety of domains, including RNA interference, proteomic datasets, and fMRI data, to motivate the data-mining community to work with domain scientists on some of the challenging problems in computational biology. In particular, he mentioned the IBM/National Geographic Genographic project (https://www3.nationalgeographic.com/genographic/), which is a collaborative effort to map humanity's journey across the globe, to understand where we came from and how we got to where we live today. By collecting genetic information from various populations, the project aims to use the DNA to trace human migration. From a data-mining point of view, a key challenge is that the data is unbalanced, with very few samples available from certain populations. Another project, Blue Brain (http://bluebrain.epfl.ch/), aims to understand brain function through computer simulations on massively parallel machines. Roy-yuru stressed the need for analytic techniques that can handle the scale and complexity of current-day biological data, the need to leverage high-end computing resources for this purpose, and the necessity to identify and potentially rank order the most interesting patterns for further evaluation by experimentalists.
Providing a statistical perspective on data analysis, Jerome Friedman of Stanford University presented techniques for building and using large ensembles of rules, as well as methods for identifying important variables and their interactions. One interesting approach is the use of lasso regression to choose useful rules, which is quite different from existing rule-selection methods in data mining. Such opportunities to learn about alternative approaches to a problem have always been a hallmark of the SDM conferences, which bring together participants from different disciplines, ranging from machine learning and statistics to various scientific and commercial application domains.
With search engines becoming ubiquitous in our lives, Corinna Cortes of Google reminded the audience that several algorithmic and theoretical challenges remain to be addressed. First, with the high cost of power---about half the cost of computing hardware over a four-year period---comes a need for smart hardware and software solutions that can increase power efficiency, by predicting, for example, when to power down disks that are not likely to be used for some time. Second, ranking, although it has received less attention than classification, is an important research area; in particular, better linear-complexity algorithms are needed. Third, we need to better understand the quality of results from data-mining algorithms, especially the ranking of search results and the quality of advertisements served to the individual who sees them. And finally, Google's MapReduce framework, with its ability to outperform more complex approaches by answering simple questions from very large amounts of data, presents opportunities that should be exploited.
Responsible at least in part for the very successful plenary poster session was a feature introduced at this conference: Each poster author had two minutes to whet the audience's appetite for his or her poster. For many student authors, this was an opportunity to practice their "elevator pitches." Another new feature of the conference was a panel session on the status and opportunities in data-mining research, chaired by Haym Hirsch, with panelists Christos Faloutsos, Mehran Sahami, Ajay Royyuru, and Jerome Friedman. Despite the diverse interests and backgrounds of the panel, several themes emerged: the need for better ways to measure real progress, and the success of benchmark data sets and grand challenge competitions in making progress measurable and in stimulating interest; the problems of interdisciplinary collaboration, particularly in applying data mining to systems biology and finding ways to use domain expertise effectively; and the power of simple algorithms on really, really large data sets.
The conference drew 250 participants from at least four continents. An enthusiastic program committee reviewed nearly 300 papers, selecting 36 for oral presentation and 39 as posters. Along with the usual large number of submissions from the U.S., Europe, Asia, and Canada were well represented. The Best Research Paper awards went to J. Sun, Y. Xie, H. Zhang, and C. Faloutsos, for "Less is More: Compact Matrix Decomposition for Large Sparse Graphs," and, in the Applications category, to J. Yang, Y. Liu, E.P. Xing, and A.G. Hauptmann, for "Harmonium Models for Semantic Video Representation and Classification." Expanded versions of the top-ranked papers will appear in a special issue of the journal Statistical Analysis and Data Mining.
The conference was supported by a number of sponsors: Google, KXEN, Lancet Software, the American Statistical Association, and the University of Minnesota. IBM and the U.S. National Science Foundation deserve special mention for student support: The funds they provided covered the travel costs of every student with a paper or poster accepted for presentation, and the registration costs of virtually all other students. As always, the SIAM staff made organizing the conference a pleasure. We look forward to another exciting meeting next April in Atlanta.---Chid Apte, Chandrika Kamath, Bing Liu, Srinivasan Parthasarathy, and David Skillicorn.