Sunday, 28 September 2014

Pivot a Hive Table

Unfortunately, Hive doesn't support pivoting a table.
To achieve this, we need to map 2 columns in one map entry, pivot it,
then lookup from the map.
Brickhouse provides a collect UDF.

Example: Before,
 | ID   | Category |  Metric  |    
 | 1    |    A     |   3      |   
 | 1    |    C     |   5      |  
 | 2    |    B     |   7      |  
 | 3    |    A     |   2      | 
 | 3    |    B     |   1      |  
 | 4    |    D     |   4      |  

After,
 | ID   |  A  |  B  |  C  |  D  | 
 | 1    |  3  |  0  |  5  |  3  |    
 | 2    |  0  |  7  |  0  |  0  |
 | 3    |  2  |  1  |  0  |  0  | 
 | 4    |  0  |  0  |  0  |  4  |



add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';


create table catergory_members_vector 
row format delimited fields terminated by ',' stored as textfile
select Id,
COALESCE(group_map['A'],0.0),
COALESCE(group_map['B'],0.0)
from ( select Id,
        collect(catergory, Metric/10.0) as group_map 
        from catergory_members 
        group by Id
    ) gm


For the create table as clause
  1. Target cannot be partitioned table.
  2. Target cannot be external table.
  3. It copies the structure as well as the data

Reference:
https://github.com/klout/brickhouse/wiki/Collect-UDFs
http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/
http://stackoverflow.com/questions/23025380/how-to-transpose-pivot-data-in-hive
http://www.qubole.com/hive-function-cheat-sheet/

No comments:

Post a Comment