Alvin's Big Data Notebook : ORCFile Format

It has support for ACID transactions and snapshot isolation, build-in indexes and complex types.

Higher Compression

ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding.

This focus on efficiency leads to some impressive compression ratios. Much better than Parquet.

Column Pruning

ORC saves IO bandwidth by only touching required columns, and requires significantly fewer seek operations because all columns within a single stripe are stored together on disk.

Predicate Pushdown

Hive 12 optimizes this by allowing predicates to be pushed down and evaluated in the storage layer itself. It’s controlled by the setting hive.optimize.ppd=true.

This requires a reader that is smart enough to understand the predicates. Fortunately ORC has had the corresponding improvements to allow predicates to be pushed into it, and takes advantages of its inline indexes to deliver performance benefits.

For example if you have a SQL query like:

SELECT COUNT(*) FROM CUSTOMER WHERE CUSTOMER.state = ‘CA’;

The ORCFile reader will now only return rows that actually match the WHERE predicates and skip customers residing in any other state. The more columns you read from the table, the more data marshaling you avoid and the greater the speedup.

ORC is able to avoid this type of overhead by performing predicate push-down with its build-in indexes. ORC provides three level of indexes within each file, file level, stripe level, and row level. The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. Row level indexes include both column statistics for each row group and position for seeking to the start of the row group. ORC utilizes these indexes to moves the filter operation to the data loading phase by only reading the data that potentially includes required rows.

Inline Indexes

ORCFile breaks rows into row groups and applies columnar compression and indexing within these row groups. ORC is able to avoid this type of overhead by performing predicate push-down with its build-in indexes. ORC provides three level of indexes within each file, file level, stripe level, and row level. The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. Row level indexes include both column statistics for each row group and position for seeking to the start of the row group. ORC utilizes these indexes to moves the filter operation to the data loading phase by only reading the data that potentially includes required rows.

The combination of indexed data and columnar storage reduces disk IO significantly.

Reference:
http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/
https://databricks.com/blog/2015/07/16/joint-blog-post-bringing-orc-support-into-apache-spark.html

Alvin's Big Data Notebook

Friday, 10 July 2015

ORCFile Format

No comments:

Post a Comment