Proceedings of the Fourth Workshop on Text Mining
The proliferation of digital computing devices and their use in communication continues to result in an increased demand for systems and algorithms capable of mining textual data. Thus, the development of techniques for mining unstructured, semi-structured, and fully structured textual data has become quite important in both academia and industry. As a result, a one-day workshop on text mining was held on April 22, 2006 in conjunction with the SIAM Sixth International Conference on Data Mining to bring together researchers from a variety of disciplines to present their current approaches and results in text mining. The workshop surveyed the emerging field of text mining - the application of techniques of machine learning in conjunction with natural language processing, information extraction and algebraic/mathematical approaches to computational information retrieval. Many issues are being addressed in this field ranging from the development of new document classification and clustering models to novel approaches for topic detection and tracking. The goal of this workshop was to provide a venue for researchers to share initial approaches and preliminary results of recent research in text mining. Fifty-four authors representing industry, academia and national research laboratories from 13 different countries submitted a total of 25 papers. After careful review, ten papers and four posters were selected for publication and presentation. Following the success of the previous three editions of the SIAM Text Mining Workshops (TM 2001, TM 2002, TM 2003) this Fourth Edition intends to generate interest and provide insight into the state of the art of text mining.
Michael W. Berry Malu Castellanos
Special thanks to Murray Browne at the University of Tennessee, Knoxville for his assistance in preparing this volume, and to the members of the Program Committee for their diligent efforts in reviewing the 25 manuscripts submitted. The workshop cover image was designed by Jeff Romaniuk at the University of Tennessee, Knoxville. We also appreciate the support of our sponsors PureDiscovery of Dallas, Texas and SAS of Cary, North Carolina.
SAS is the market leader in providing a new generation of business intelligence software and services that create true enterprise intelligence. SAS solutions are used at 40,000 sites - including 96 of the top 100 companies on the Fortune Global 500® - to develop more profitable relationships with customers and suppliers; to enable better, more accurate and informed decisions; and to drive organizations forward. SAS is the only vendor that completely integrates leading data warehousing, analytics and traditional BI applications to create intelligence from massive amounts of data. For nearly three decades, SAS has been giving customers around the world The Power to Know®.
PureDiscovery Corporation, based in Dallas, Texas, is a privately held software company. PureDiscovery is the creator of EXgrid, the intelligent grid architecture that transforms existing data repositories into dynamic knowledge and innovation networks. EXgrid creates universal access to virtually any information without disrupting or replacing the clients existing network infrastructure. Organizations placing an emphasis on research, intellectual property, intelligence gathering, collaboration and knowledge sharing can significantly benefit from the use of EXgrid.
Michael W. Berry, University of Tennessee Rosie Jones, Yahoo Research Labs
Abstract: An important problem that faces many governmental and industrial organizations is that of discovering the description of a recurring phenomenon in text documents. In many applications, the recurring phenomenon has a low frequency of occurrence, thus complicating its discovery. We call such low-frequency events that tend to co-occur “recurring anomalies.” Conventional text mining methods tend to overlook these low-frequency events. The problem of discovering recurring anomalies arises in numerous application domains including fraud, counter-terrorism and security, analysis of complex systems, and warranty and maintenance reports. This talk describes the problem in some detail from a mathematical perspective and then discusses the past and current work in the field. We compare the performance of several existing methods and novel text mining methods that we have developed on text reports regarding complex aerospace systems.
Session I: Topic Detection and Tracking
Session II: Text Classification
Session III: Poster Presentations
Posters will be on display throughout the workshop and in this session presenters will briefly summarize their work.
Using Query History to Prune Query Results
OPTICS on Text Data: Experiments and Test Results
ZIP and Data Document Visualization
Session IV: Clustering Algorithms
Session V: LSA and Vector Space Models