Detecting Anomalies

In several fields, it is often required to detect the occurrence of an anomalous event while monitoring a stream of data. I first came across this problem in the context of network security during my time at AT&T Labs. In network security, the goal is to quickly identify suspected malicious activity, but it is difficult to know in advance what a problematic pattern will look like, especially when dealing with high dimensional data. On the other hand, it is typically easy to characterize non-abnormal data. 

I've found people sometimes mistakenly treat this problem as a 2-class classification problem. The 2-class approach is unsuitable because data corresponding to anomalous events may not arise from the same distribution each time an anomalous event occurs, and/or one is unlikely to have sufficient labelled samples of anomalous events to estimate the distribution of anomalous data properly. Therefore if one tries to apply a traditional classifier to this problem, one is likely to experience an unacceptably high missed detection rate (think of a SVM margin-maximizing hyperplane that positions itself between the nearest points from the non-anomalous class and a few points from the non-anomalous class. Since points from the anomalous class are few and/or they are drawn from different distributions, the hyperplane resulting from the SVM optimization is positioned closer to the anomalous points than it should be, and hence, an anomalous point may be wrongly classified as an anomalous point).

One approach to dealing with this problem is to focus on the distribution of the non-anomalous data. The problem then boils down to determining whether a given new test data point is drawn from the distribution of non-anomalous data or not. Except for the simplest of problems, the distribution of non-anomalous data may well be multi-modal, and we may not know the number of modes. The one-class SVM comes in handy here. The 1-class SVM basically attempts to estimate the density of the non-anomalous data non-parametrically by discovering a region in feature space that captures most (a user-supplied fraction) of the probability mass. It does this by finding a hyperplane (which is curved in the original space but linear in the original space -- thanks to the standard SVM kernel trick) that forces the specified fraction of data points to lie inside closed regions formed by the hyperplane in the original space. Happily, the 1-class SVM does this non-parametrically and can discover multi modal densities. The anomaly detection problem then boils down to determining whether a given new test data point lies inside or outside the regions of high-density of the non-anomalous data points.

The folks at scikit-learn have a good implementation of the 1-class SVM.  

Constructing a good anomaly detector is still a bit of an art because it requires proper selection of features to monitor. In a very high dimensional space, the curse of dimensionality is likely to make it difficult to build an anomaly detector with a small enough false positive rate.