Alvin's Big Data Notebook : Supervised v.s. Unsupervised Learning

1. Training set and testing set
Machine learning is about learning some properties of a data set and applying them to new data. This is why a common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the training set on which we learn data properties and one that we call the testing set on which we test these properties.

2. Supervised Learning
The data comes with additional attributes that we want to predict .This problem can be either:

classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of classification problem would be the handwritten digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.
With regression, we can forecast a numeric value for a number of inputs. This is an improvement over classification because we’re predicting a continuous value rather than a discrete category.

Linear regression means you can add up the inputs multiplied by some constants to get the output.

For example,

HorsePower = 0.0015*annualSalary - 0.99*hoursListeningToPublicRadio

Above is a regression equation. The values 0.0015 and -0.99 are regression weights. The process of finding these regression weights is called regression.

3. Unsupervised Learning

The training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

Reference:

http://scikit-learn.org/stable/tutorial/basic/tutorial.html

Alvin's Big Data Notebook

Wednesday, 26 November 2014

Supervised v.s. Unsupervised Learning

No comments:

Post a Comment