Friday 8 August 2014

Common Used Settings to Optimize Pig Scripts

set default_parallel 10;
set job.priority HIGH;
set io.sort.mb 1024;
set mapred.child.java.opts -Xmx4096m;
set mapred.compress.map.output true;
set mapred.map.output.compression.codec com.hadoop.compression.lzo.LzopCodec;
set pig.cachedbag.memusage 0.15;
set io.sort.factor 100;
set opt.multiquery false;
set mapred.task.timeout 1800000;
set pig.maxCombinedSplitSize 334217728;

Combine Small Input Files

We can't directly set the number of mappers in Pig script, like set mapred.map.tasks
Processing input (either user input or intermediate input) from multiple small files can be inefficient because a separate map has to be created for each file. Pig can now combined small files so that they are processed as a single map.
You can set the values for these properties:
  • pig.maxCombinedSplitSize – Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined untill this size is reached.
  • pig.splitCombination – Turns combine split files on or off (set to “true” by default).

Use the Parallel Features

You can set the number of reduce tasks for the MapReduce jobs generated by Pig using two parallel features. (The parallel features only affect the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.)
You Set the Number of Reducers
Use the set default parallel command to set the number of reducers at the script level.
Reference:
https://pig.apache.org/docs/r0.11.1/perf.html#memory-management

2 comments:

  1. If i want mappers = 10 then what shpuld be my pig.maxCombinedSplitSize in bytes?

    ReplyDelete
  2. @Sagar, depends on the input size. Set the split size so that the input divides into 10 parts.

    ReplyDelete