Thursday, 18 December 2014

Alternating Least Squares in MLlib

Alternating Least Squares(ALS)
the algorithm repeatedly refines the contents of one matrix by solving a least-squares minimization problem that holds the other one fixed. Then it alternates, to refine the other matrix, and repeats iteratively. This is the source of its name. To begin the process, one matrix is filled with random feature vectors.

Importantly, ALS computations are quite parallelizable, because they can be decomposed into straightforward linear algebra operations on each row independently. They can be efficiently optimized for sparse data. 

These algorithms are sometimes called matrix completion algorithms, becausethe original matrix A may be quite sparse, but the approximate product is completely dense.

ALS is used to solve CF problem.

Collaborative filtering (CF) is a technique used by some recommender systems.

For example, deciding that two users may share similar tastes because they are the same age is not an example of collaborative filtering. Deciding that two users may like the same song since they play many of the same other songs is an example.

MLlib's ALS implementation requires numeric IDs for users and items, and further requires them to be nonnegative 32-bit integers, which means IDs larger than Integer.MAX_VALUE, 2147483647 can't be used.

By default, the RDD will contain one partition for each HDFS block. Because ML tasks is more compute-intensive, it's better to have more partitions. This can let Spark put more processor cores to work on the problem at once.






No comments:

Post a Comment