SIAM International Conference on Data  Mining, April 11-13, 2002, Hyatt Regency, Crystal City at Ronald Reagan National Airport, Arlington, VA

Tutorials

Thursday, April 11        
10 am - noon Data Mining in the Face of Contaminated and Incomplete Records
 Roald K. Pearson, ETH Zurich
3 pm - 6 pm* Enterprise Customer Data Mining for E-Business

Usama Fayyad, Co-Founder, President & CEO, digiMine, Inc.
Neal Rothleder, Director of Analytic Technology, digiMine, Inc.
Paul Bradley, Data Mining Development Lead, digiMine, Inc.

Friday, April 12          
10 am - noon Problems, Solutions and Research in Data Quality

Tamraparni Dasu and Theodore Johnson, AT&T Labs Research

3 pm - 6 pm* Text Mining for Bioinformatics

Hinrich Schütze, Novation Biosciences

* A catered break will be held from 4 - 4:30 pm.

Abstracts and Biographical Information

Title: Data Mining in the Face of Contaminated and Incomplete Records
Presented by:  Roald K. Pearson, ETH Zurich
Abstract:

This tutorial has three main objectives. The first is to provide a general overview of the sources and extent of contaminated and missing records in the large datasets for which highly automated data mining procedures are intended. The second objective is to clearly demonstrate that the consequences of simply ignoring these data anomalies are often unacceptable, either because important features in the dataset are missed altogether, or because these features are grossly misinterpreted. Finally, the third objective is to provide a broad overview of some of the techniques that have been proposed by various authors to address these problems.

Specific topics covered include the important practical distinction between noise, to which most data analysis procedures are somewhat resistant by design, and outliers, which often cause dramatic failures. In addition, distinctions are drawn between ignorable missing data, which generally increases the variability of computed results, and non-ignorable missing data, which can introduce significant biases and fundamentally change the conclusions of our analysis. Conversely, both outliers and non-ignorable missing data often correspond to what Zhong et al. have called peculiar data, representing precisely those observations we are most interested in finding in a dataset.

Examples are presented to illustrate the nature and extent of outliers and missing data in real datasets (typical concentration estimates range from a few percent to ~30%), and their influence is illustrated for a wide variety of analytical methods.  Four key ideas for dealing with these issues are then discussed in detail: data cleaning (how do we detect outliers in practice?), data imputation (how do we replace missing or anomalous data values once we have found them?), the application and/or development of outlier-resistant analysis procedures, and sensitivity analysis, which attempts to detect influential observations or sets of observations in a dataset.
Biographical Information Roald K. Pearson  received his PhD in electrical engineering from M.I.T. in 1982, after which he joined the DuPont Company where his activities included the exploratory analysis of large sets of manufacturing process operating data. In 1997, Dr. Pearson joined the Institut für Automatik at ETH in Zürich, where he continued to work in the areas of exploratory data analysis and the development of discrete-time dynamic models for computer control. Currently, he is a visiting professor with the Tampere International Center for Signal Processing at the University of Technology in Tampere, Finland.
Title: Enterprise Customer Data Mining for E-Business
Presented by:

Usama Fayyad, Co-Founder, President & CEO, digiMine, Inc.
Neal Rothleder, Director of Analytic Technology, digiMine, Inc.
Paul Bradley, Data Mining Development Lead, digiMine, Inc.

Abstract:

Data mining methods have their origins in a variety of fields: Statistics, Databases, Pattern Recognition, AI, Visualization, High-Performance Computing, and Information Retrieval. Successful deployment of these technologies to e-business enterprise data requires: data warehouse construction, mechanisms to efficiently update the warehouse, integration of data mining technologies, and delivery of results in a form consumable by business end-users.

In an e-business enterprise environment, the data warehouse problem is further magnified by the critical need to integrate web-log data, user profile data, product catalog information, transaction and sales data, advertising campaign data, datasets from legacy systems, etc. Once the data warehouse is in place, the next steps involve integrating analytical and data mining technology efficiently with the warehouse. A key challenge to an e-business enterprise is delivering timely, interesting, actionable results to an end-user who's expertise is marketing, sales, business development, or merchandising rather than data mining and advanced analytics.

Biographical Information:

Usama Fayyad  is a co-founder of digiMine, Inc. and has served as President and CEO since its inception in March 2000. Prior to digiMine, Usama founded and led Microsoft Research's Data Mining & Exploration (DMX) Group from 1995 to 2000. His work there included the development of data mining prediction components for Microsoft Site Server (Commerce Server 3.0 and 4.0). From 1989 to 1995, Usama founded the Machine Learning Systems Group and developed data mining systems for the analysis of large scientific databases at the Jet Propulsion Laboratory (JPL), California Institute of Technology. During that time he received the most distinguished excellence award from Caltech/JPL and a U.S. Government Medal from NASA. He remained affiliated with JPL as Distinguished Visiting Scientist after joining Microsoft. Usama has a Ph.D. in engineering from the University of Michigan, Ann Arbor (1991). He has served as Program Co-Chair of KDD-94 and KDD-95 and as General Chair of KDD-96 and KDD-99. Usama serves as Editor-in-Chief of the journal Data Mining and Knowledge Discovery and SIGKDD Explorations.

Neal Rothleder  is Director of Analytic Technology at digiMine, Inc.  His focus is on delivering powerful, scalable data mining solutions to business users in an intuitive, actionable framework. His research interests include machine learning approaches to data mining, recently focusing on making academic research work in real-world problems and incorporating domain knowledge into data mining. Prior to joining digiMine, Dr. Rothleder was a Lead Engineer with the MITRE Corporation working on research and development in data mining technologies and applications. While there, he worked on projects in network intrusion detection, aviation safety, and a variety of fraud detection scenarios.  Dr. Rothleder has held adjunct faculty appointments at the University of Michigan and George Mason University. He holds a Ph.D.. and an M.S. in Computer Science and Engineering from the University of Michigan.

Paul Bradley ([email protected]) is Data Mining Development Lead at digiMine. His primary focus is on integrating data mining technology into digiMine's service offering. Prior to joining digiMine, he was a researcher in the Data Management, Exploration and Mining Group at Microsoft Research. While at Microsoft Research, he worked on data mining algorithms and on data mining components in Microsoft SQL Server and Commerce Server. His research interests include classification and clustering algorithms; underlying mathematical problem formulations; and issues related to scalability. He received the Ph.D. degree from the University of Wisconsin in 1998 on the topic of mathematical programming and data mining.  Paul serves as Associate Editor of SIGKDD Explorations and was KDD-2001

 

Title: Problems, Solutions and Research in Data Quality
Presented by:

Tamraparni Dasu and Theodore Johnson
AT&T Labs Research

Abstract:

Data quality is inextricably linked with mining datasets. Data quality problems arise during the process of data mining, and the quality of the data in turn determines the importance and value of the results that are unearthed by mining the data. Data quality has many facets, such as management of processes and practices; statistical detection of glitches; storage, monitoring, maintenance and profiling of data. Recent work has developed tools and algorithms for assuring data quality in datasets. However, data quality has been studied piecemeal by disciplines and communities that seldom communicate. Many specific problems and solutions are cited in an ad hoc fashion. However, it is possible to broadly categorize the data quality problems and propose a general class of solutions.  Our aim in this tutorial is to bring together the different threads to:

  • Define and update notions of data quality.
  • Articulate data quality problems as applicable to contemporary data sets (federated, massive, streaming) and their uses (data mining, interactive decision making).
  • Outline major existing areas and algorithms (including but not limited to duplicate removal, string matching, finding keys, functional dependencies, join paths), statistical techniques (missing values, outlier detection, departure from assumptions and models, goodness of fit), tools in the research community (Bellman, Potter’s Wheel, AJAX) and commercial vendors (Vality, Trillium, E-voke etc.).
  • Identify research questions and open problems.
We will illustrate the data quality aspects discussed above by working through case studies drawn from different contexts..
Biographical Information:

Tamraparni Dasu  received a B.A. (Honors) in Mathematical Statistics from Delhi University in 1982, followed by a Masters in Mathematics from I.I.T. (Indian Institute of Technology), New Delhi in 1984. She finished her Ph.D. in Statistics from the University of Rochester in 1990. Tamraparni joined the Statistical Modeling department at AT&T Bell Laboratories in 1990. She moved to the Machine Learning and Information Retrieval Research department in 1995, and then to the Information Mining research center of AT&T labs - Research in 2000, where she currently works.

Theodore Johnson received a B.S in Mathematics from Johns Hopkins University in 1986, and a Ph.D. in Computer Science from the Courant Institute of New York University.  From 1990 through 1995 Theodore was an Assistant Professor at the CISE department of the University of Florida, and an Associate Professor in 1996.  In 1996, Theodore joined the Database Research department of AT&T Labs Research, where he currently works.
Title: Text Mining for Bioinformatics
Presented by:

Hinrich Schütze, Novation Biosciences

Abstract: Our goal is to make this tutorial a practical guide for how to use text mining in bioinformatics while at the same time highlighting some of the interesting research issues that arise when mining techniques are applied in bioinformatics. Participants will be able to broaden the set of tools they are comfortable with if they work in bioinformatics (drug discovery, pharmaceutical companies etc). Or they will learn about one of the most exciting areas of application of data discovery and analysis techniques if they are data miners currently working on non-biological problems.  Previous exposure to biology will be helpful, but the tutorial will be accessible to those who have no biology background. We will assume familiarity with basic statistical and probabilistic concepts.This tutorial is a joint work with Russ B. Altman, MD, Associate Professor in the Medical Informatics Group at the Stanford University Medical Center.

 

Biographical Information:

Hinrich Schütze, PhD. After receiving a Ph.D. in Natural Language Processing from Stanford University in 1995, Hinrich Schütze joined the Xerox Palo Alto Research Center, where he developed a scaleable approach to semantic analysis of natural language based on mining of association data. He then co-founded Outride, a search personalization company, and led the development of personalization software that learns user preferences from surfing behavior. He is author of the best-selling textbook on data-driven natural language processing (with Chris Manning, MIT Press) and of a dozen issued and pending patents. Dr. Schütze is currently CTO of Novation Biosciences, a bioinformatics company focused on text and data mining of biological data. He is also Consulting Faculty at Stanford.


© 2001 Society for Industrial & Applied Mathematics
Last Updated: 12/04/01