Alvin's Big Data Notebook : Hive Optimization Tips

Divide data among different files than can be pruned out by using partition, bucket, and skews
ORC file format
Sort and bucket on common join keys
Use map(broadcast) joins whenever possible
Increase the replication factor for hot data(which reduces latency)
Take advantage of Tez

Hive Query Tunings

mapreduce.input.fileinputformat.split.maxsize

mapreduce.input.fileinputformat.split.minsize

mapreduce.tasks.io.sort.mb

In addition, set the important join and bucket properties to true in hive-site.xml or by using the set command.

Apache Tez

Tez – Hindi for “speed” provides a general-purpose, highly customizable 
framework that creates simplifies data-processing tasks across both 
small scale (low-latency) and large-scale (high throughput) workloads in
 Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG  of
 tasks for a single job.

Tez models data processing as a dataflow graph with vertices in the graph representing application logic and edges representing movement of data. Tez models the user logic running in each vertex of the dataflow graph as a composition of Input, Processor and Output modules. This dataflow pipeline can be expressed as a single Tez job that will run the entire computation. Expanding this logical graph into a physical graph of tasks and executing it is taken care of by Tez.

Tez has container pool.
Tez avoids unneeded writes to HDFS.

Running Hive on Tez

We can enable Hive on Tez execution and take advantage of Directed 
Acyclic Graph (DAG) execution representing the query instead of multiple
 stages of MapReduce program which involved a lot of synchronization, 
barriers and IO overheads. This is improved in Tez, by writing 
intermediate data set into memory instead of hard disk.

Use the following step to set the execution engine to Tez:

set hive.execution.engine=tez;

Run Pig on Tez:

pig -x tez test.pig

What is Vectorization?

When Vectorization feature is used, it fetches 1000 rows at a time instead of 1 for processing. So, it can process up to 3X faster with less CPU time. This results in improved cluster utilization. It is to address the latency Problem in Hive by extensive Container use and reuse. Vectorization feature works on Hive tables with ORC File Format only.

create table hvac_orc

stored as orc tblproperties ("orc.compress"="SNAPPY")

as select * from hvac;

set hive.execution.engine=tez;

set hive.vectorized.execution.enabled;

Reference:
http://hortonworks.com/hadoop-tutorial/supercharging-interactive-queries-hive-tez/
https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution

Alvin's Big Data Notebook

Wednesday, 8 July 2015

Hive Optimization Tips

No comments:

Post a Comment