|
Model Based Detection of Distribution Changes in Temporal Data Sets
Igor Cadez
GCS, Inc.
Probabilistic machine learning is inference via the correct statistical
methods. In the traditional setup there are a training data set, a test
data set, a set of probabilistic models and the inference goal. The
model is "built" on the training data set using the likelihood objective
function, validated on the test data set and the inference is "performed" using the built model. When the data source is dynamic,
e.g., there is a data stream generating data continuously, the simple
model validation on the test data is sufficient only if the
data-generating mechanism does not change over time, which would be an
exception rather than a rule. In most business applications the process
responsible for data generation changes and evolves with evolution of
the business itself and is additionally subject to unpredictable
external influences. While in principle one can update or rebuild the
model as more data becomes available, in the business world the model
itself represents a parsimonious summary of the data such that
rebuilding the model entails rebuilding the business rules derived from
the model. For example, if the model is a mixture model of customers
with each component representing a group with hand-picked
characteristics and used in developing marketing campaigns, a new model
would require not only another round of hand-picking the relevant
characteristics of the new groups, but also a redefinition of the
campaigns.
In this work we define the problem of "change detection", i.e., the
proper statistical method for detecting a point in time when the
existing model applied to a stream of data shows a
statistically-significant deviation from the earlier data, thus
warranting building of a new model. In the process, we introduce a novel
one-dimensional probabilistic summary of the data, the "logP"
distribution, which acts as a "signature" of any complex multivariate
distribution, with a rigorous proof that two multivariate distributions
are equal if and only if their signatures are equal. With this newly
developed tool we show how one in practice performs change detection in
linear time and with minimal computational resources. We also show
real-life examples of the implementation of the change detection
framework and its performance. In the end, we touch on the follow-up
business issue of finding out what the detected change is once it has
been detected.
|