Wednesday, 8 July 2015

Hive Optimization Tips

  1. Divide data among different files than can be pruned out by using partition, bucket, and skews
  2. ORC file format
  3. Sort and bucket on common join keys
  4. Use map(broadcast) joins whenever possible
  5. Increase the replication factor for hot data(which reduces latency) 
  6. Take advantage of Tez

Hive Query Tunings


In addition, set the important join and bucket properties to true in hive-site.xml or by using the set command.

Apache Tez

Tez – Hindi for “speed” provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG  of tasks for a single job.

Tez models data processing as a dataflow graph with vertices in the graph representing application logic and edges representing movement of data. Tez models the user logic running in each vertex of the dataflow graph as a composition of Input, Processor and Output modules. This dataflow pipeline can be expressed as a single Tez job that will run the entire computation. Expanding this logical graph into a physical graph of tasks and executing it is taken care of by Tez.

Tez has container pool.
Tez avoids unneeded writes to HDFS.

Running Hive on Tez

We can enable Hive on Tez execution and take advantage of Directed Acyclic Graph (DAG) execution representing the query instead of multiple stages of MapReduce program which involved a lot of synchronization, barriers and IO overheads. This is improved in Tez, by writing intermediate data set into memory instead of hard disk.
Use the following step to set the execution engine to Tez:
set hive.execution.engine=tez;

Run Pig on Tez:

pig -x tez test.pig

What is Vectorization?

When Vectorization feature is used, it fetches 1000 rows at a time instead of 1 for processing. So, it can process up to 3X faster with less CPU time. This results in improved cluster utilization. It is to address the latency Problem in Hive by extensive Container use and reuse. Vectorization feature works on Hive tables with ORC File Format only.

create table hvac_orc 
stored as orc tblproperties ("orc.compress"="SNAPPY")
as select * from hvac;

set hive.execution.engine=tez;
set hive.vectorized.execution.enabled;  


No comments:

Post a Comment