Alvin's Big Data Notebook : Anomaly Detection in K-means Clustering

1. Anomaly Detection in supervised and unsupervised learning.
Anomaly detection is often used to find fraud, detect network attacks, or discover problems in servers or other sensor-equipped machinery. The purpose is able to find new types of anomalies that have never seen before—new forms of fraud, new intrusions, new failure modes for servers.

If we already knew what “anomalous” meant for a data set, we could easily detect anomalies in the data with supervised learning. An algorithm would receive inputs labeled “normal” and “anomaly” and learn to distinguish the two. However, the nature of anomalies are that they are unknown unknowns. Because there are always new kinds of anomalies, which are unknown so far.

Unsupervised learning techniques are useful in these cases, because they can learn what input data normally looks like, and therefore detect when data is unlike the other data, which is worth further investigation.

2. K-means Clustering
Clustering algorithms try to find natural groupings in data. Data points that are like one another, but dislike others, are likely to represent a meaningful grouping, and so clustering algorithms try to put such data into the same cluster. Anything not close to a cluster could be anomalous.
Currently, the distance measure supported in MLlib is the Euclidean distance, which is defined for data points whose features are all numeric.

This cluster centroid is defined to be the arithmetic mean of the points—hence the name K-means.

The algorithm intelligently picks some data points as the initial cluster centroids.
Each data point is assigned to the nearest centroid.
For each cluster, a new cluster centroid is computed as the mean of the data points just assigned to that cluster.
Above process is repeated until no new centroid is generated.

The cluster algorithm can be embedded in Spark streaming job to score new data as it arrives in near-real-time, and perhaps trigger an alert or review.

Alvin's Big Data Notebook

Monday, 16 February 2015

Anomaly Detection in K-means Clustering

No comments:

Post a Comment