Alvin's Big Data Notebook : Auto Increment Row Id in Hive

When we don't want to explicitly display a column in Hive to clients for privacy reason. We may have following options.

1. hash(column) can generate collisions. The hash value and its original value are not one-to-one mapping. Hence, it's impossible to track the original value from its hash value.
SHA1 has collision in theory, but don't exist for strings of short length. Git uses SHA1 hashes as IDs and there are still no known SHA1 collisions in 2014.

2. Since Hive runs many mappers/reducers in parallel, there is no way to generate a globally unique increasing row id. using sequential numbers obviously not does not work due to the parallel nature of Hadoop.

3. select reflect("java.util.UUID", "randomUUID"), customer_name, address, unique_value from table_name UUID will generate a unique id for each row. But the table can't be updated, since no way to guarantee no duplicate id for the new records.

4. taskid+sequence number. Since taskid is unique in one hadoop cluster, each task has a batch of records, it's easy to make the records in one task with unique sequence number. https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/tools/mapred/RowIdUDF.java

5. Encrypt and Decrypt http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive http://thinkbigdataanalytics.com/creating-udf-in-hive-hadoop/

Reference:
http://datafu.incubator.apache.org/docs/datafu/guide/hashing.html
http://stackoverflow.com/questions/2479348/is-it-possible-to-get-identical-sha1-hash
https://github.com/nexr/hive-udf https://issues.apache.org/jira/browse/HIVE-6329 https://github.com/sharethrough/hive-udfs https://github.com/brndnmtthws/facebook-hive-udfs

Alvin's Big Data Notebook

Saturday, 6 December 2014

Auto Increment Row Id in Hive

No comments:

Post a Comment