Thursday 9 July 2015

Ngrams in Hive

ngrams() and context_ngrams(): N-gram frequency estimation

N-grams are subsequences of length N drawn from a longer sequence. The purpose of the ngrams() UDAF is to find the k most frequent n-grams from one or more sequences. It can be used in conjunction with the sentences() UDF to analyze unstructured natural language text, or the collect() function to analyze more general string data.
Contextual n-grams are similar to n-grams, but allow you to specify a 'context' string around which n-grams are to be estimated. For example, you can specify that you're only interested in finding the most common two-word phrases in text that follow the context "I love". You could achieve the same result by manually stripping sentences of non-contextual content and then passing them to ngrams(), but context_ngrams() makes it much easier.

SELECT explode(context_ngrams(sentences(lower(tweet)), 2100 [, 1000])) FROM twitter;
The command above will return the top-100 bigrams (2-grams) from a hypothetical table called twitter. The tweetcolumn is assumed to contain a string with arbitrary, possibly meaningless, text. The lower() UDF first converts the text to lowercase for standardization, and then sentences() splits up the text into arrays of words. 
The optional fourth argument is the precision factor that control the tradeoff between memory usage and accuracy in frequency estimation. Higher values will be more accurate, but could potentially crash the JVM with an OutOfMemory error. If omitted, sensible defaults are used.
SELECT explode(context_ngrams(sentences(lower(tweet)), array("i","love",null), 100, [, 1000])) FROM twitter;
The command above will return a list of the top 100 words that follow the phrase "i love" in a hypothetical database of Twitter tweets. Each null specifies the position of an n-gram component to estimate; therefore, every query must contain at least one null in the context array.

Reference:

No comments:

Post a Comment