Tuesday, 4 November 2014

Classification Algo in Mahout

Mahout classification algorithms include naive Bayes, complementary naive Bayes, stochastic gradient descent (SGD), and random forests.
In Mahout, SGD and random forests are good use of continuous variables.

SGD algorithm for logistic regression is a sequential (nonparallel) algorithm. Most importantly for working with large data, the SGD algorithm uses a constant amount of memory regardless of the size of the input.

1. Logistic regression
A kind of classification model in which the predictor variables are combined with linear weights and then passed through a soft-limit function that limits the output to the range from 0 to 1 using the logistic function 1 / (1+e-x). The output of a logistic regression model can often be interpreted as probability estimates. 


The features given to the logistic regression must be in numerical form, so text, words, and categorical values must be encoded in vector form.


2. Stochastic gradient descent (SGD) is a widely used learning algorithm in which each training example is used to tweak the model slightly to give a more correct answer for that one example.
Although SGD algorithms are difficult to parallelize effectively, they’re often so fast that for a wide variety of applications, parallel execution isn’t necessary.

3. Naive Bayes
The naive Bayes and complementary naive Bayes algorithms in Mahout are parallelized algorithms that can be applied to larger data sets. Naive Bayes is restricted to classification based on a single text-like variable.
If you have more than ten million training examples and the predictor variable is a single, text-like value, naive Bayes or complementary naive Bayes may be your best choice of algorithm. For other types of variables, or if you have less training data, try SGD.


4. Random Forests

This algorithm trains an enormous number of simple classifiers and uses a voting scheme to get a single result. 
It doesn't scale well. Because each small classifier is trained on some of the features of all of the training examples, the memory required on each node in the cluster will scale roughly in proportion to the square root of the number of training examples. 

It can solve problems hard for the above algo. Typically these problems require a model to use variable interactions and discretization to handle threshold effects in continuous variables. 


Process steps:
1. Preprocessing raw data
2. Converting data to vectors

In Mahout parlance, a Vector is a data type that stores floating-point numbers indexed by integers. 
Feature hashing: SGD-based classifiers avoid the need to predetermine vector size by simply picking a reasonable size and shoehorning the training data into vectors of that size. 


FeatureVectorEncoder encoder =
    new StaticWordValueEncoder("variable-name");

for (DataRecord ex: trainingData) {
  Vector v = new RandomAccessSparseVector(10000);
  String word = ex.get("variable-name");
  encoder.addToVector(word, v);
  // use vector
}

Reference:
"Mahout In Action"


No comments:

Post a Comment