Alvin's Big Data Notebook : Notes of Clustering Algorithm in Mahout

Tuesday, 12 August 2014

Notes of Clustering Algorithm in Mahout

1. The centre of the circle is called the centroid, or mean(average) of the cluster
2. Find a function qualifies the similarity between any two data points as a number.
3. Define a similarity metric.
4. Calculate Euclidean distance measure between two points.
5. Define weighting: TF-IDF (term frequency-inverse document frequency)
6. Preprocess data, use the data to create vector, and load in SequenceFile for Mahout inputs.
7. Marking the initial clusters is an important step in K-means clustering.
8. Vector refers to an ordered list of numbers, which is all a point or physics vector is anyway.
Vectors have a number of dimensions and a numeric value for each dimension.
9. The K-means algorithm will readjust them at the end of each iteration by computing the average centre of all points in the cluster.
10. A key factor in Cluster is the choice of distance measure.
11. Manhattan distance measure, Cosine distance measure and Weighted distance measure.

Alvin's Big Data Notebook

Tuesday, 12 August 2014

Notes of Clustering Algorithm in Mahout

No comments:

Post a Comment