Alvin's Big Data Notebook : Latent Semantic Analysis in MLlib

1. Latent Semantic Analysis (LSA) is a technique in natural language processing and information retrieval that seeks to better understand a corpus of documents and the relationships between the words in those documents.

It attempts to distill the corpus into a set of relevant concepts.

Each concept captures a thread of variation in the data and often corresponds to a topic that the corpus discusses.

2. TF-IDF:

val tf = termFrequencyInDoc.toDouble / totalTermsInDoc

val docFreq = totalDocs.toDouble / termFreqInCorpus

TF-IDF captures two intuitions about the relevance of a term to a document. First, one would expect that the more often a term occurs in a document, the more important it is to that document. Second, not all terms are equal in a global sense. It is more meaningful to encounter a word that is occurs rarely in the entire corpus than a word that appears in most of the documents, thus the metric uses the inverse of the word’s appearance in documents in the full corpus.

log is used to mallow the differences of common words and rare words in document frequencies.

3. LSA discovers this lower-dimensional representation using a linear algebra technique called Singular Value Decomposition (SVD).

It starts with a term-document matrix generated through the counting word frequencies for each document. In this matrix, each document corresponds to a column, each term corresponds to a row, and each element represents the importance of a word to a document.
SVD then factorizes this matrix into three matrices, one of which expresses concepts in regard to documents, one of which expresses concepts in regard to terms, and one of which contains the importance for each concept.
The structure of these matrices is such that a low-rank approximation of the original matrix can be achieved by removing a set of their rows and columns corresponding to the least important concepts. That is, the matrices in this low-rank approximation can be multiplied to produce a matrix close to the original, with increasing loss of fidelity as each concept is removed.

The singular value decomposition takes a m x n matrix and returns three matrices that approximately equal it when multiplied together.

U is a m x k matrix where each row corresponds to a document and each column corresponds to a concept.
S is a k x k diagonal matrix that holds the singular values. Each diagonal element in S corresponds to a single concept.
V is a n x k matrix where each row corresponds to a term and each column corresponds to a concept.

In the LSA case, m is the number of documents and n is the number of terms. The decomposition is parameterized with a number k, less than or equal to n, that indicates how many concepts to keep around. A key insight of LSA is that only a small number of concepts are important to representing that data.
To find the singular value decomposition of a matrix, simply wrap an RDD of row vectors in a RowMatrix and call computeSVD:

import org.apache.spark.mllib.linalg.distributed.RowMatrix

termDocMatrix.cache()
val mat = new RowMatrix(termDocMatrix)
val k = 1000
val svd = mat.computeSVD(k, computeU=true)

Reference:
"Advanced Analytics with Spark"

Alvin's Big Data Notebook

Thursday, 25 December 2014

Latent Semantic Analysis in MLlib

No comments:

Post a Comment