Thursday 9 July 2015

Hive Skew Tables

This feature can be used to improve performance for tables where one or more columns have skewed values. By specifying the values that appear very often (heavy skew) Hive will split those out into separate files (or directories in case of list bucketing) automatically and take this fact into account during queries so that it can skip or include the whole file (or directory in case of list bucketing) if possible.
This can be specified on a per-table level during table creation.
This is an example where we have one column with three skewed values, optionally with the STORED AS DIRECTORIES clause which specifies list bucketing:
CREATE TABLE list_bucket_single (key STRING, value STRING)
  SKEWED BY (key) ON (1,5,6) [STORED AS DIRECTORIES];

No comments:

Post a Comment