Saturday 11 July 2015

Table Partitioning vs Bucketing


The main difference between Hive partitioning and Bucketing
when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file.

Drawbacks of Partitions:
A design that creates too many partitions may optimize some queries, but be detrimental for other important queries. 
Having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
Advantages of Bucketing:
The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

Bucketing Operation
Bucketed tables are fantastic in that they allow much more efficient sampling than do non-bucketed tables, and they may later allow for time saving operations such as mapside joins. However, the bucketing specified at table creation is not enforced when the table is written to, and so it is possible for the table's metadata to advertise properties which are not upheld by the table's actual layout. This should obviously be avoided. Here's how to do it right.
First, 
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(ds STRING)
CLUSTERED BY(user_id) INTO 256 BUCKETS;
Note that we specify a column (user_id) to base the bucketing.
Then we populate the table
set hive.enforce.bucketing = true
FROM user_id
INSERT OVERWRITE TABLE user_info_bucketed
PARTITION (ds='2009-02-25')
SELECT userid, firstname, lastname WHERE ds='2009-02-25';
The command set hive.enforce.bucketing = true; allows the correct number of reducers and the cluster by column to be automatically selected based on the table. Otherwise, you would need to set the number of reducers to be the same as the number of buckets a la set mapred.reduce.tasks = 256; and have a CLUSTER BY ... clause in the select.
How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expressionhash_function(bucketing_column) mod num_buckets. The hash_function depends on the type of the bucketing column. 

Reference:

No comments:

Post a Comment