SIAM 2005 Data Mining Conference



	Model Based Detection of Distribution Changes in Temporal Data Sets Igor Cadez GCS, Inc. Probabilistic machine learning is inference via the correct statistical methods. In the traditional setup there are a training data set, a test data set, a set of probabilistic models and the inference goal. The model is "built" on the training data set using the likelihood objective function, validated on the test data set and the inference is "performed" using the built model. When the data source is dynamic, e.g., there is a data stream generating data continuously, the simple model validation on the test data is sufficient only if the data-generating mechanism does not change over time, which would be an exception rather than a rule. In most business applications the process responsible for data generation changes and evolves with evolution of the business itself and is additionally subject to unpredictable external influences. While in principle one can update or rebuild the model as more data becomes available, in the business world the model itself represents a parsimonious summary of the data such that rebuilding the model entails rebuilding the business rules derived from the model. For example, if the model is a mixture model of customers with each component representing a group with hand-picked characteristics and used in developing marketing campaigns, a new model would require not only another round of hand-picking the relevant characteristics of the new groups, but also a redefinition of the campaigns. In this work we define the problem of "change detection", i.e., the proper statistical method for detecting a point in time when the existing model applied to a stream of data shows a statistically-significant deviation from the earlier data, thus warranting building of a new model. In the process, we introduce a novel one-dimensional probabilistic summary of the data, the "logP" distribution, which acts as a "signature" of any complex multivariate distribution, with a rigorous proof that two multivariate distributions are equal if and only if their signatures are equal. With this newly developed tool we show how one in practice performs change detection in linear time and with minimal computational resources. We also show real-life examples of the implementation of the change detection framework and its performance. In the end, we touch on the follow-up business issue of finding out what the detected change is once it has been detected.
	Last Edited: 2/14/05 DHTML Menus by http://www.milonic.com/

Model Based Detection of Distribution Changes in Temporal Data Sets