Statistics and Practical Applications of Data Mining: Highlights from SDM04July 25, 2004
The Fourth SIAM International Conference on Data Mining, held in Orlando, Florida, April 22-24, 2004, continued the tradition of providing an open forum for the presentation and discussion of innovative algorithms, as well as novel applications of data mining. A record number of paper submissions this year marked not only a growing interest in the field, but also a greater acceptance of the conference among data mining researchers and practitioners. Student authors accounted for a large percentage of the accepted papers (and their papers were reviewed under the same stringent guidelines as regular papers).
Speakers in three categories received "best paper" awards: Martin Law of Michigan State University received the best student paper award for his work on manifold learning (done with Nan Zhang and Anil Jain, also of Michigan State University); a team from the University of Texas at Austin (Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, and Joydeep Ghosh) received the best algorithms paper award for their work on clustering, and the best applications paper was on enhancing communities of interest by a team from AT&T Laboratories (Deepak Agarwal and Daryl Pregibon).
A running theme of the conference was the practical application of data mining, including opportunities in various problem domains and practical lessons learned by those solving real data analysis problems in these domains. This was reflected in the topics covered in the three tutorials: analysis of patients' medical data, data mining for computer security, and mistakes commonly made in data mining and ways to avoid them.
In an industry-government session, speakers discussed problems encountered in the telecommunications industry; the role of information visualization; and data mining in such diverse domains as aviation safety and security, performance of computer networks, and earth sciences. Applications of data mining were also the subject of three of the keynote talks: Sara Graves of the University of Alabama at Huntsville considered issues of data usability, David Page of the University of Wisconsin Medical School elaborated on data mining questions raised by biology data, and Ted Senator of DARPA discussed the challenges of "connecting the dots." The increasing importance of homeland security was also reflected in many of the conference workshop topics, which ranged from link analysis, counterterrorism, and privacy to data mining in resource-constrained environments. More traditional topics---bio-informatics, mining of scientific and engineering datasets, and high-performance and distributed mining---also continued to attract participants.
Conference attendees clearly welcomed the focus on applications, which led to animated discussions in the industry-government presentations. One conference speaker took John Elder's tutorial on common mistakes in data mining to heart-she did some real-time editing of her presentation to point out the mistakes in her application domain, such as a lack of caution in sampling the data and a discounting of pesky cases even though they might reveal a larger problem in the data.
A new aspect of this year's conference was the increasingly important role of statistics in data mining. Keynote speaker Chris Bishop of Microsoft Research, Cambridge, discussed recent advances in Bayesian inference techniques, and several technical sessions focused on statistical techniques in data mining. This connection between statistics and data mining will be exploited further in the next conference in the series (scheduled for Newport Beach, April 21-23, 2005), which will be co-sponsored by the American Statistical Association and SIAM (http://www.siam.org/meetings/sdm05/). We encourage statisticians and data miners to submit papers and to attend the conference, helping us to narrow the gap between the two fields and bring together the best of both worlds.
The proceedings of SDM04, including the keynote talks and the presentations from the industry-government session, are available online at http://www.siam.org/meetings/sdm04/. It was a great conference, and all sessions were well attended, despite the close proximity to several local attractions. We look forward to working with our statistician colleagues to make the next conference even better!
Chandrika Kamath leads the Sapphire data mining project at Lawrence Livermore National Laboratory. She has played an active role in organizing the SIAM data mining conferences, serving as program co-chair for the 2003 conference and as conference co-chair for 2004 and 2005.