Wednesday 24 December 2014

Anomaly Detection with K-means Clustering


1. Supervised vs. unsupervised learning

In order to predict unknown values for new data, we had to know that target value for many previously-seen examples. 
Classifiers can only help if we, the data scientists, know what we are looking for already, and can provide plenty of examples where input produced a known output. 
These were collectively known as supervised learning techniques, because their learning process is given the correct output value for each example in the input.

However, there are problems in which the correct output is unknown for some or all examples.
unsupervised learning techniques can help here. These techniques do not learn to predict any target values, since none are available. They can however learn structure in data, and find groupings of similar inputs, or learn what types of input are likely to occur and what types are not. 

2. Anomaly detection

Anomaly detection is often used to find fraud, detect network attacks, or discover problems in servers or other sensor-equipped machinery. In these cases, it’s important to be able to find new types of anomalies that have never seen before—new forms of fraud, new intrusions, new failure modes for servers.An anomaly that has been observed and understood is no longer an anomaly.

3. K-means

It attempts to detect k clusters in a data set, where k is given by the data scientist. choosing a good value for k will be a central topic. A clustering could be considered good if each data point were near to its closest centroid.
It is common to use simple Euclidean distance to measure distance between data points with K-means, this is the only distance function supported by Spark MLlib as of this writing. 

This center is called the cluster centroid, and is defined to be the arithmetic mean of the points—hence the name K-means.
  1. To start, the algorithm intelligently picks some data points as the initial cluster centroids. 
  2. Then each data point is assigned to the nearest centroid. 
  3. Then for each cluster, a new cluster centroid is computed as the mean of the data points just assigned to that cluster. 
  4. This process is repeated.

No comments:

Post a Comment