Sunday 9 November 2014

Predicate Pushdown in Hive ORC format.



The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data.
Data stored in ORCFile can be read or written through HCatalog, so any Pig or Map/Reduce process can play along seamlessly.

Hive 12 optimizes this by allowing predicates to be pushed down and evaluated in the storage layer itself. It’s controlled by the setting hive.optimize.ppd=true, which should be true by default. The ORCFile reader will now only return rows that actually match the WHERE predicates and skip customers residing in any other state. 

You can force Hive to sort on a column by using the SORT BY keyword when creating the table and setting hive.enforce.sorting to true before inserting into the table.

CREATE TABLE mytable (
...
) STORED AS orc tblproperties ("orc.compress"="SNAPPY");

Reference:
http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html

No comments:

Post a Comment