Project Name: Detecting change points in data streams
Inventors: Dr. Ran Wolf, Dr. Morad Badama
Field Of The Invention
The invention relates to analysis of computerized data streams in general, and in particular to a computerized method for detecting change points in data streams.
Modern computing technology enables to gather and process large quantities of data in a variety of fields such as finance, commerce, operations etc. In some cases, efficient and quick analysis of such high speed data streams can be very valuable in order to detect a change in trends or condition as early as possible. Click-through stream mining in e-commerce, where the goal of the application is to predict shopping behavior or the effect of advertising, is one notable example. Additional examples of high speed data streams include computerized production environment monitoring applications whose goal is failure detection, traffic monitoring applications that give driving recommendations or on-line alerts, and power grid applications for detecting changes in load profiles and forecast. In all those scenarios analysis is best done on-line, at the speed at which the data is arriving, as a delay in analysis would often translate into a delayed response which can be costly.
In almost each of these scenarios, the data streams are affected in one way or another by human behavior, which itself changes in response to the physical world (time of day or season), fashion, fads, psychological reasons, action by trendsetters, current events, or the economy. Any data stream analysis algorithm must therefore take into account and respond to the non-stationary nature of data distribution.
Furthermore, in many application domains, the change in the underlying distribution of the data is the most interesting event of all. In e-commerce, it can be the result of a change in the competitive scenario. In computerized environment monitoring, it can signal the spread of a new type of failure—such as a new computer virus. Lastly, in stock trading it may signal the move from a bull to a bear market or vice versa. Changes in the mechanism which generates the data are denoted concept drifts. They are especially important because they evoke a need for new responses, different from those dictated by models which were learned before the change occurred.
Most data streams mining algorithms acknowledge the need to handle concept drifts. Two approaches are prevalent: One is to discard old observations. The other is to relearn the model, or parts of the model, when a concept drift becomes evident. However, most data stream mining algorithms rely on a decline in the performance of the model as an indication for concept drift detection. This method, while sometimes effective, has no statistical backing and therefore can be expected to yield inferior results comparing to statistical based change point detection algorithms.
From a statistical point of view, the change point detection problem can be solved optimally by computing the prefix of the current sequence of samples which maximizes the probability that the suffix was sampled from a different distribution. This can be done subject to a set of assumptions on the distribution of the samples (e.g., that it is Normal) and of changes (e.g., that their arrival rate is Poissonian). This approach is, however, impractical for a large number of samples. The state of the art in statistical change point detection on data streams is therefore to use the Page-Hinkley test (PHT), whose run-time is linear in the number of samples. In a streaming setup that would mean maintaining a test statistic of constant size and performing O(1) updates to it per new sample. Naturally, run-time performance like this can only be achieved at a significant cost in terms of false alarm rate, the number of samples needed to detect a change, and the accuracy at which the change point is detected.
The present invention relates to an alternative to PHT which relies on the best practice of solving the more informed problem of testing whether two sets of samples were derived from the same distribution. The algorithms of the invention make use of the unique convergence properties of two sample tests to probabilistically find the point which maximizes their value. That point closely approximates the change point. As both analysis and experiments show, the probabilistic algorithm of the invention maintains just O (1) candidate change points and their related aggregate information. Therefore, it only requires O (1) update operations per new sample, which is comparable with PHT. However, because the two sample tests used by the invention are much more powerful than PHT, and because the probabilistic algorithm of the invention does not degrade that power significantly, the algorithm of the invention is far better than PHT both in terms of false negative to false positive rate and in terms of the accuracy at which it locates to the change point. This superiority is further exemplified in a simplistic application in which the algorithm monitors the mean of a piece-wise stationary data stream at far better accuracy than the one achieved using PHT or others previous approaches.