The
org.apache.hadoop.hive.ql.udf.generic.GenericUDF
API provides a way to write code for objects that are not writable types, for example - struct
, map
and array
types.
This api requires you to manually manage object inspectors for the function arguments, and verify the number and types of the arguments you receive. An object inspector provides a consistent interface for underlying object types so that different object implementations can all be accessed in a consistent way from within hive (eg you could implement a struct as a
Map
so long as you provide a corresponding object inspector.
The API requires you to implement three methods:
// this is like the evaluate method of the simple API. It takes the actual arguments and returns the result
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);
// Doesn't really matter, we can return anything, but should be a string representation of the function.
abstract String getDisplayString(String[] children);
// called once, before any evaluate() calls. You receive an array of object inspectors that represent the arguments of the function
// this is where you validate that the function is receiving the correct argument types, and the correct number of arguments.
abstract ObjectInspector initialize(ObjectInspector[] arguments);
The call pattern for a UDF is the following:
- The UDF is initialized using a default constructor.
udf.initialize()
is called with the array of object instructors for the udf arguments- Evaluate is called for each row in your query with the arguments provided
Python UDF:
Hive UDF written in python is much connivence than Java.
In fact, it's a steaming operation. The performance is not as good as Java UDF.
For example, a udf
#!/usr/bin/python import sys import datetime for line in sys.stdin: tokens = line.strip().split("\t") //do transforming on each tokens print("\t".join([tokens[0], tokens[1],...]))
To run the Hive udf,
ADD FILE hdfs://ipaddress/path/udf.py; select TRANSFORM(*) using "python udf.py" as (filed_1 STRING, filed_2 INT) FROM mytable
Note:
If you want to output the fields from "mytable", those fields must be in Transform().
Reference:
http://ragrawal.wordpress.com/2013/09/14/detecting-gender-bias-per-movie-genre-using-hive/
http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html
No comments:
Post a Comment