Wednesday 7 January 2015

Implementing Java/Python UDFs in Hive

Java UDF:

The org.apache.hadoop.hive.ql.udf.generic.GenericUDF API provides a way to write code for objects that are not writable types, for example - structmap and array types.
This api requires you to manually manage object inspectors for the function arguments, and verify the number and types of the arguments you receive. An object inspector provides a consistent interface for underlying object types so that different object implementations can all be accessed in a consistent way from within hive (eg you could implement a struct as a Map so long as you provide a corresponding object inspector.
The API requires you to implement three methods:
// this is like the evaluate method of the simple API. It takes the actual arguments and returns the result
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);

// Doesn't really matter, we can return anything, but should be a string representation of the function.
abstract String getDisplayString(String[] children);

// called once, before any evaluate() calls. You receive an array of object inspectors that represent the arguments of the function
// this is where you validate that the function is receiving the correct argument types, and the correct number of arguments.
abstract ObjectInspector initialize(ObjectInspector[] arguments);

The call pattern for a UDF is the following:
  1. The UDF is initialized using a default constructor.
  2. udf.initialize() is called with the array of object instructors for the udf arguments 
  3. Evaluate is called for each row in your query with the arguments provided


Python UDF:

Hive UDF written in python is much connivence than Java.
In fact, it's a steaming operation. The performance is not as good as Java UDF.
For example, a udf

#!/usr/bin/python
import sys
import datetime

for line in sys.stdin:
    tokens = line.strip().split("\t")
    //do transforming on each tokens
    print("\t".join([tokens[0], tokens[1],...]))

To run the Hive udf,

ADD FILE hdfs://ipaddress/path/udf.py;

select TRANSFORM(*)
using "python udf.py"
as (filed_1 STRING, filed_2 INT)
FROM mytable


Note:
If you want to output the fields from "mytable", those fields must be in Transform().

Reference: 
http://ragrawal.wordpress.com/2013/09/14/detecting-gender-bias-per-movie-genre-using-hive/
http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html

No comments:

Post a Comment