Mining Multilingual Documents

September 24, 2008


Figure 1. Interface for the prototype system QCS.

Dianne P. O'Leary and John M. Conroy

Answering questions from information found on the Web is hard in any case, but what if the relevant information is in a variety of different languages? Suppose, for example, that we want to determine what citizens of Kenya think about their new cabinet.

We will miss essential information if we consider only English-language documents. But challenges arise when we include non-English documents: It is not realistic to expect human translations of these documents, and existing machine-translation systems are far from perfect. Keywords can be translated incorrectly (e.g., "Kenyan box" for "Kenyan cabinet"), and translations are often ungrammatical. When punctuation is missing in source documents, context or referential information almost certainly gets lost. In spite of these challenges, we must rely on machine translations.

Even if we retrieve the relevant documents, we have not necessarily answered the question. Some questions---for example, Who is Michael Jordan?---yield such overwhelming amounts of information on one possible answer (in this case, the retired basketball player) that it is difficult to find other possibilities (the jazz musician, the mathematician, and the many other people with this name). Therefore, it is important that documents be clustered. And the clusters will be most useful if each is accompanied by a summary.

To answer a question, then, users need a tool that can retrieve a set of relevant documents, cluster the documents by subtopic, and summarize each cluster (in English). Our research has been directed toward the creation of a tool with these capabilities, and we have developed a system that can function without human intervention. What might be surprising, even to the SIAM community, is that retrieval, clustering, and summarization can all be handled by purely mathematical techniques.

Document retrieval can be formulated as a problem in linear algebra called "latent semantic analysis." Each document is represented by a column in a matrix, in which each row corresponds to a term in the dictionary. The magnitude of the entry in a given row and column indicates the importance of that term in that document. We use the singular value decomposition to form a low-rank approximation to the matrix; we can then rank documents for relevance to a question by forming the inner product between the vector representing the question and the matrix. Approximations other than the SVD (the rank-revealing QR, the semi-discrete decomposition, or a non-negative decomposition) might improve retrieval.

Document clustering is an optimization problem: Find cluster centers that minimize some measure of total distance from each document vector to its cluster center.

Once we have formed a document cluster, only two mathematical decisions remain: Choose sentences from the documents to include in the summary, and order the sentences in the summary. We score each sentence by an algorithm that estimates the probability that a human would choose that sentence, based on the estimated worth of its individual words. We then frame the problem of choosing from the collection of high-scoring sentences as a subset-selection problem. We use latent semantic analysis and a non-negative variant of the QR factorization to solve this problem. As to the ordering of the sentences in the summary, one approach is to solve a Traveling Salesperson Problem that forms a "tour" of all the high-scoring sentences, with the "distances" between sentences inversely proportional to the cosine of the angle between the term–sentence vectors.

With Daniel Dunlavy (of Sandia National Laboratories, Albuquerque) and Judith Schlesinger (of the IDA Center for Computing Sciences, in Bowie, Maryland), we have developed and implemented a prototype system called QCS (Query, Cluster, Summarize). The interface, shown in Figure 1, displays a query, summaries for the top-scoring clusters, and pointers to the documents in these clusters.

Arabic, with more than 30 varieties (Egyptian, Iraqi, Gulf Arabic, . . .), is a particularly challenging language for machine translation; even standardized newswire articles typically omit vowels, punctuation, and sentence deliminators. The Multilingual Summarization Evaluation (MSE), a competition devised by computational linguists, focuses on summarizing collections of Arabic and English newswire articles that have already been retrieved and clustered. In this evaluation, we found that our mathematics-based algorithms, coupled with simple linguistic rules to trim unnecessary words from the sentences, outperformed all the other competing automatic summarizing systems, each of which was based much more deeply in linguistics. In fact, our system even outscored three of four human summarizers in a measure called ROUGE-2 that scores common two-word pairs, as well as in an evaluation done by human judges.

Although machine-generated summaries are far from perfect, mathematics---particularly linear algebra and optimization---forms the basis for highly effective algorithmic approaches.

References
[1] I.S. Dhillon, S. Mallela, and D.S. Modha, Information-theoretic co-clustering, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, August 24–27, 2003, 89–98; http://doi.acm.org/10.1145/956750.956764.

[2] D.M. Dunlavy, D.P. O'Leary, J.M. Conroy, and J.D. Schlesinger, QCS: A system for querying, clustering, and summarizing documents, Infor. Process. Manage., 43:6 (2007), 1588–1605; DOI: 10.1016/j.ipm.2007.01.003; based on a 2006 technical report.

[3] J.D. Schlesinger, D.P. O'Leary, and J.M. Conroy, Arabic/English multi-document summarization with CLASSY---The past and the future, CICLing Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 17–23, 2008; also in Comput. Linguist. Intel. Text Process., Lecture Notes in Computer Science, Springer, Berlin, 4919 (2008), 568–581.

Diane O'Leary is a professor in the Department of Computer Science and the Institute for Advanced Computer Studies at the University of Maryland, College Park. John Conroy is a research scientist at the IDA Center for Computing Sciences, in Bowie, Maryland.

Dianne O'Leary of the University of Maryland (right), co-author of the accompanying article on the mining of multilingual documents (based on her invited talk at this year's SIAM Conference on Data Mining), is pictured here with AWM president Cathy Kessel and SIAM president Cleve Moler on the occasion of an altogether different talk: the 2008 AWM–SIAM Sonia Kovalevsky Lecture.

A joint honor of the Association for Women in Mathematics and SIAM, the Kovalevsky Lecture is intended to highlight significant contributions of women to applied and computational mathematics.

In her lecture, "A Noisy Adiabatic Theorem: Wilkinson Meets Schrödinger's Cat," given in San Diego at the 2008 SIAM Annual Meeting, O'Leary departed from her usual linear algebra focus in favor of a PDE/physics theme in honor of Sonia Kovalevsky.

Schrödinger's cat, the central character in a thought experiment regarding quantum superposition, and James Wilkinson, famous for his work on the effects of perturbations on floating-point computations, were used to set the stage for a discussion of the effects of perturbation on quantum systems, and the resulting implications for adiabatic quantum computing.

Readers who missed what Moler described as "an absolutely great lecture" will be happy to know that it (and most of the other invited talks at the meeting) was taped and will be accessible as soon as possible at www.siam.org.




Renew SIAM · Contact Us · Site Map · Join SIAM · My Account
Facebook Twitter Youtube