Tuesday 12 August 2014

Notes of Clustering Algorithm in Mahout

1. The centre of the circle is called the centroid, or mean(average) of the cluster
2. Find a function qualifies the similarity between any two data points as a number.
3. Define a similarity metric.
4. Calculate Euclidean distance measure between two points.
5. Define weighting: TF-IDF (term frequency-inverse document frequency)
6. Preprocess data, use the data to create vector, and load in SequenceFile for Mahout inputs.
7. Marking the initial clusters is an important step in K-means clustering.
8. Vector refers to an ordered list of numbers, which is all a point or physics vector is anyway.
Vectors have a number of dimensions and a numeric value for each dimension.
9. The K-means algorithm will readjust them at the end of each iteration by computing the average centre of all points in the cluster.
10. A key factor in Cluster is the choice of distance measure.
11. Manhattan distance measure, Cosine distance measure and Weighted distance measure.




No comments:

Post a Comment