When we don't want to explicitly display a column in Hive to clients for privacy reason.
We may have following options.
1.
hash(column) can generate collisions. The hash value and its original value are not one-to-one mapping.
Hence, it's impossible to track the original value from its hash value.
SHA1 has collision in theory, but don't exist for strings of short length. Git uses SHA1 hashes as IDs and there are still no known SHA1 collisions in 2014.
2.
Since Hive runs many mappers/reducers in parallel, there is no way to
generate a globally unique increasing row id.
using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
3. select reflect("java.util.UUID", "randomUUID"), customer_name, address, unique_value from table_name
UUID will generate a unique id for each row. But the table can't be updated, since no way to guarantee no duplicate id for the new records.
4. taskid+sequence number.
Since taskid is unique in one hadoop cluster, each task has a batch of records, it's easy to make the records in one task with unique sequence number.
https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/tools/mapred/RowIdUDF.java
5. Encrypt and Decrypt
http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive
http://thinkbigdataanalytics.com/creating-udf-in-hive-hadoop/
Reference:
http://datafu.incubator.apache.org/docs/datafu/guide/hashing.html
http://stackoverflow.com/questions/2479348/is-it-possible-to-get-identical-sha1-hash
https://github.com/nexr/hive-udf
https://issues.apache.org/jira/browse/HIVE-6329
https://github.com/sharethrough/hive-udfs
https://github.com/brndnmtthws/facebook-hive-udfs
No comments:
Post a Comment