Alvin's Big Data Notebook : Introduction to Classification Algorithm

Classification is often used for prediction or detection. The output of a classification system is the assignment of data to one of a small predetermined set of categories.

1. If your system is unsupervised, it is not classification. If your questions are open-ended or the answers aren’t categorical, your system is not classification.

Below are basic distinctions among Classification, Recommendation and Clustering in Machine Learning.

Clustering algorithms are able to decide on their own (in machine learning parlance, they’re unsupervised learning algorithms) A supervised learning algorithm is one that’s given examples that contain the desired value of a target variable.
Classification algorithms learn to mimic examples of correct decisions (they’re supervised learning algorithms). They’re intended to make a single decision with a very limited set of possible outcomes.
Recommendation algorithms select and rank the best of many possible alternatives. Recommendation systems work well with items that are seen by many users.

2. There are two main phases involved in building a classification system:

the creation of a model produced by a learning algorithm,
and the use of that model to assign new data to categories.

3. Some basic concepts in classification.

Key idea	Description
Model	The output of the training algorithm is a model, which is a function that can then be applied to new examples in order to produce outputs that emulate the decisions
Training data	A subset of training examples labeled with the value of the target variable and used as input to the learning algorithm to produce the model.
Test data	A withheld portion of the training data with the value of the target variable hidden so that it can be used to evaluate the model.
Training	The learning process that uses training data to produce a model. That model can then compute estimates of the target variable given the predictor variables as inputs.
Feature	A known characteristic of a training or a new example; a feature is equivalent to a characteristic.
Predictor variable	A feature selected for use as input to a classification model. Not all features need be used. Some features may be algorithmic combinations of other features.
Target variable	A feature that the classification model is attempting to estimate: the target variable is categorical, and its determination is the aim of the classification system.

The process of describing a particular characteristic in a way that can be used by a classification system is known as feature extraction. If a feature is chosen to be used as input for the model, the value of that feature would then be thought of as a predictor variable. Commonly, most of the effort required to build a classifier is spent inventing and extracting useful features.
The fewer redundant or useless features you include, the more likely it is that your classifier will produce accurate results.

Often the target variable is a binary categorical variable, meaning it has only two possible values.

Things that look like integers aren’t necessarily continuous. A key test is to imagine adding two values together or taking the log or square root of a value. If this doesn’t make any sense, then you probably have a categorical value rather than a continuous one.

Reference:

"Mahout In Action"

Alvin's Big Data Notebook

Sunday, 2 November 2014

Introduction to Classification Algorithm

No comments:

Post a Comment