Alvin's Big Data Notebook : Introduction to Yarn (Yet Another Resource Negotiator)

In MR1, each node was configured with a fixed number of map slots and a fixed number of reduce slots. Under YARN, there is no distinction between resources available for maps and resources available for reduces – all resources are available for both.

Virtual Cores

To better handle varying CPU requests, YARN supports virtual cores (vcores), a resource meant to express parallelism. The “virtual” in the name is somewhat misleading: Rather, on the NodeManager, vcores should be configured equal to the number of physical cores on the machine. Tasks should be requested with vcores equal to the number of cores they can saturate at once. Currently vcores are very coarse; tasks will rarely want to ask for more than one of them, but a complementary axis that represents processing power will likely be added in the future to enable finer-grained resource configuration.

In MR1, the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties dictated how many map and reduce slots each TaskTracker had. These properties no longer exist in YARN. Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, which control the amount of memory and CPU on each node, both available to both maps and reduces.

The fundamental idea of MRv2's YARN architecture is to split up the two primary responsibilities of the JobTracker: resource management and job scheduling/monitoringinto separate daemons:

a global ResourceManager (RM) : Scheduler and ApplicationsManager.
per-application ApplicationMasters (AM).

Below is the duty of each components:

Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc.

ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

NodeManager(was the TaskTracker) is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager /Scheduler.

The per-application ApplicationMaster(running in a container) has the responsibility of negotiating appropriate resource containers from the ResourceManager at runtime , tracking their status and monitoring for progress.

With YARN and MapReduce 2, there are no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Maps and Reduces as needed by the job.

YARN takes into account all the available compute resources on each machine in the cluster. Based on the available resources, YARN will negotiate resource requests from applications (such as MapReduce) running in the cluster.

YARN then provides processing capacity to each application by allocating Containers.

YARN gives Hadoop the ability to run non-MapReduce jobs(Storm, Spark, etc) within the Hadoop framework. YARN positions MapReduce as merely one of the application frameworks within Hadoop. It enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala

A Container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements (memory, cpu etc.).
A container grants rights to an application to use a specific amount of resources (e.g., memory, CPU) on a specific host. The ApplicationMaster must take the container and present it to the NodeManager managing the host, on which the container was allocated, to use the resources for launching its tasks.

Reference:
http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/YARN.html
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html

http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-operators/

Alvin's Big Data Notebook

Thursday, 18 September 2014

Introduction to Yarn (Yet Another Resource Negotiator)

1 comment: