Alvin's Big Data Notebook : MRv2 and Yarn Memory Configuration

The final calculation is to determine the amount of RAM per container:

RAM-per-Container = maximum of (MIN_CONTAINER_SIZE, (Total Available RAM) / Containers))

With these calculations, the YARN and MapReduce configurations can be set:

Configuration File	Configuration Setting	Value Calculation
yarn-site.xml	yarn.nodemanager.resource.memory-mb	= Containers * RAM-per-Container
yarn-site.xml	yarn.scheduler.minimum-allocation-mb	= RAM-per-Container
yarn-site.xml	yarn.scheduler.maximum-allocation-mb	= containers * RAM-per-Container
mapred-site.xml	mapreduce.map.memory.mb	= RAM-per-Container
mapred-site.xml	mapreduce.reduce.memory.mb	= 2 * RAM-per-Container
mapred-site.xml	mapreduce.map.java.opts	= 0.8 * RAM-per-Container
mapred-site.xml	mapreduce.reduce.java.opts	= 0.8 * 2 * RAM-per-Container
yarn-site.xml (check)	yarn.app.mapreduce.am.resource.mb	= 2 * RAM-per-Container
yarn-site.xml (check)	yarn.app.mapreduce.am.command-opts	= 0.8 * 2 * RAM-per-Container

By using the calculate script provided by Hortonworks, we get the recommended memory
setting for Yarn.

Using cores=16 memory=32GB disks=5 hbase=False

Profile: cores=16 memory=31744MB reserved=1GB usableMem=31GB disks=5

Num Container=9

Container Ram=3072MB

Used Ram=27GB

Unused Ram=1GB

Set the following properties in /etc/hadoop/conf/yarn-site.xml and mapred-site.xml
But if you config from Cloudera Manager->Configure. Search for each property, and set its value in Cloudera page, instead of xml files. After that, click 'Save change' -> Action->Deploy client configuration -> Restart.

//set the minimum unit of RAM to allocate for a Container

yarn.scheduler.minimum-allocation-mb=3072

yarn.scheduler.maximum-allocation-mb=27648

//set the maximum memory Yarn can utilize on each node

yarn.nodemanager.resource.memory-mb=27648

//since each mapper or reduce runs in a separate container, its memory should be at least as one container
mapreduce.map.memory.mb=3072

mapreduce.reduce.memory.mb=3072

//each container run JVMs for the map and reduce tasks, so JVM heap size should be less than map and reduce's as above 75-80%.
mapreduce.map.java.opts=-Xmx2457m

mapreduce.reduce.java.opts=-Xmx2457m

yarn.app.mapreduce.am.resource.mb=3072

yarn.app.mapreduce.am.command-opts=-Xmx2457m

mapreduce.task.io.sort.mb=1228

The java.opts parameters are the amount of memory that will be used for map/reduce child tasks. Wheras the memory.mb is the amount of memory that is cared out as a container. You therefore want to ensure enough overhead for MR2 work to run within the container along with the user code. We recommend having the container size larger than the java opts, at 75-80% of the container size.

One other thing, the "Client Java Heap Size in Bytes" is the java client heap size which is the size of the heap that is defined on the node where the job is submitted. So for example there is an edge node that the customer runs the job from, the client java heap size sets the amount of java heap that this particular node will get when it runs the job. Typically the client java heap size doesn't need to be larger than the default, rarely larger than 1-2GB

yarn.nodemanager.resource.cpu-vcores=8 (CPU Cores to use in each for each node by all containers)

Reducing the number of maximum vcores helps to limit the no of containers being created on a node.

Containers are limited by a combination of cpu-vcores setting and memory. If you only allocate 1 vcore, you'll only get 1 container, no matter how much RAM.

You are also limited by the number of vcores a given NodeManager has. By default each task uses 1 vcore, so if your NodeManager has 24 vcores, it can never have more than 24 containers regardless of how much memory it has.

The disk latency became the bottleneck at the higher vcores settings (spiking to 500 millis).

You should also consider using CM Role Groups so that you can set some of your nodes to 16 and the others to 24. You can find out more about role groups here: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v4-latest/Cloudera-Manager-Managing-Clusters/cmmc_role_grps.html

Reference:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_installing_manually_book/content/rpm-chap1-11.html
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

Alvin's Big Data Notebook

Sunday, 21 September 2014

MRv2 and Yarn Memory Configuration

No comments:

Post a Comment