Alvin's Big Data Notebook : Mahout K-Mean Clustering for Web Log Analysis

1. Vectorize Vectors in SequenceFile

A sequence file has a header which contains information on the key/value class names, version, file format, metadata about the file and sync marker to denote the end of the header. The header is followed by the records which constitute the key/value pairs and their respective lengths.

We can write a mapper toread a file in hdfs and convert each line(delimited by ",") to a vector.
To get the clusteredPoints after kmean, we have to wrap NameVector() by using "item_id" as key.

public class VectorizeMapper extends Mapper {


  public static final int NUM_COLUMNS = 158;
  
  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {

    String line = value.toString();
    
    if(line != null)
    {
        String sCurrentLine = value.toString();
        String item_name = sCurrentLine.split(",")[0];
        double[] features = new double[NUM_COLUMNS-1];
        for(int indx=1; indx<NUM_COLUMNS;++indx){
            features[indx-1] = Float.parseFloat(sCurrentLine.split(",")[indx]); 
        }
    
      NamedVector name_vec = new NamedVector(new DenseVector(features), item_name);

        VectorWritable vec = new VectorWritable();
        
        vec.set(name_vec);
        
        context.write(new Text(name_vec.getName()), vec);
    }

  }
}

2. K-mean clustering $mahout kmeans \
-i input_vector \ the input folder contains vector in seqfile format
-c clusters \ The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first.
-o cluster_output \ the output folder contains clusteredPoints and clusters-final folders
-k 10 \ a random selection of k vectors will be chosen as the Centroid and written to the clusters input path.The run() method calls the RandomSeedGenerator before calling the job() method to run the iterations.
-cd 0.1\ The convergence delta value.If in a particular iteration the centers of the clusters don’t change beyond this threshold, no further iterations are done. Default is 0.5.
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
similarity measure. The classname of the DistanceMeasure. Default is SquaredEuclidean
-x 10 \ The maximum number of iterations.
-cl \ If present, run clustering after the iterations have taken place. run input vector clustering after computing Canopies. Need this one to get clusteredPoints. All the clustering routines now observe it so that the step of clustering the points is uniformly optional
-ow \ overwrite the output folder

3. Cluster Dumping to Readable Outputs:
each cluster contains its points
$mahout clusterdump -dt sequencefile //the dictionary file type(text or sequecefile)
-i cluster_output/clusters-3-final //hdfs folder
-p cluster_output/clusteredPoints //the directory containing points sequence files mapping input vectors to their cluster.
-of CSV //the output format TEXT, CSV or GRAPH_ML. If CSV is selected, the result contains: //clusters with their items. e.g. cluster id, item1, item2,.... item_n, Otherwise, it will contain more details info.
-o result.csv //write to local folder instead of hdfs.

Read the results:

where CL-0 is the Cluster 0 and n=116 refers to the number of points observed by this cluster and c = [29.922 ...] refers to the center of Cluster as a vector and r = [3.463 ..] refers to the radius of the cluster as a vector.

Reference:
http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program/15003316#15003316
https://github.com/josephmisiti/hadoop- examples/tree/master/mahout/clustering http://www.mimul.com/pebble/default/2012/04/08/1333871113118.html http://stackoverflow.com/questions/8785392/how-to-perform-k-means-clustering-in-mahout-with-vector-data-stored-as-csv
https://mahout.apache.org/users/clustering/clusteringyourdata.html https://mahout.apache.org/users/clustering/k-means-clustering.html https://mahout.apache.org/users/clustering/k-means-commandline.html https://mahout.apache.org/users/clustering/cluster-dumper.html http://lucene.472066.n3.nabble.com/where-are-the-points-in-each-cluster-kmeans-clusterdump-td838683.html

Alvin's Big Data Notebook

Wednesday, 5 November 2014

Mahout K-Mean Clustering for Web Log Analysis

No comments:

Post a Comment