Thursday, 12 March 2015

Spark Submit Configs on YARN


Spark Running Modes

In standalone mode, Spark uses a Master daemon which coordinates the efforts of the Workers, which run the executors. Standalone mode is the default, but it cannot be used on secure clusters.

spark-submit \
--class org.apache.spark.examples.SparkPi \
--deploy-mode client \
--master spark//$SPARK_MASTER_IP:$SPARK_MASTER_PORT \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10


In YARN mode, the YARN ResourceManager performs the functions of the Spark Master. The functions of the Workers are performed by the YARN NodeManager daemons, which run the executors. YARN mode is slightly more complex to set up, but it supports security, and provides better integration with YARN’s cluster-wide resource management policies.

Multiple Spark applications can run at once. If you decide to run Spark on YARN, you can decide on an application-by-application basis whether to run in YARN client mode or cluster mode.
When you run Spark in client mode, the driver process runs locally; in cluster mode, it runs remotely on an ApplicationMaster




Why Run on YARN?

  • Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone and Mesos: 
  • YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration.
  • You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
  • Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use.
  • Finally, YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.

Launch Spark on YARN

There are two deploy modes that can be used to launch Spark applications on YARN.
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.
In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
Unlike in Spark standalone and Mesos mode, in which the master’s address is specified in the “master” parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the master parameter is simply “yarn-client” or “yarn-cluster”.

To launch a Spark application in yarn-cluster mode:

$ ./bin/spark-submit 
    --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 3 \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    lib/spark-examples*.jar \
    10
The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. 
To launch a Spark application in yarn-client mode, do the same, but replace “yarn-cluster” with “yarn-client”. To run spark-shell:
$ ./bin/spark-shell --master yarn-client

VM Overhead

Containers are being killed where the executors are running inside of them. This is usually due to run-away allocation on memories that are off-heap and were not taken into account as part of the VM overhead, etc.

In Spark, there is a default settings called spark.yarn.executor.memoryOverhead that is used for executor's VM overhead.
The following command is a quick fix to the problem. Suggest to use 15-20% of the executor memory settings for this configuration (spark.yarn.executor.memoryOverhead) in your Spark job.

Add Other Jars
In yarn-cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client. To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command.
$ ./bin/spark-submit --class my.main.Class \
    --master yarn-cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar
    my-main-jar.jar
    app_arg1 app_arg2


A complete Example:


$ sudo -u username /opt/spark-1.4/bin/spark-submit \
--class com.company.className --master yarn-cluster \
--num-executors 4  --executor-memory 12g --driver-memory 4g \
--driver-java-options "-XX:MaxPermSize=2G" --executor-cores 32 \
--driver-java-options -Dconfig.resource=/dev.conf \
--files=/path/hive-site.xml \
--jars mysql-connector-java-5.1.10.jar\
/opt/path/latest/libs/app-assembly-0.1.jar --arg1 jobName

Above command supports HiveContext in Spark App.
the --files and --jars arguments provide the libraries that are distributed along with your job. These include third party libraries and any others that you use (import/include) in your code.

Debug your App


In YARN terminology, executors and application masters run inside “containers”. YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the “yarn logs” command.
yarn logs -applicationId <app ID> will print out the contents of all log files from all containers from the given application.

yarn logs -applicationId application_1430770131521_1545 | less


sudo -u myid /opt/spark-1.5/bin/spark-submit --class com.company.MyJob --master spark://ipaddress:7077 --total-executor-cores 12 --executor-cores 2 --executor-memory 3g --driver-memory 4g --driver-java-options "-XX:MaxPermSize=2048m -Dconfig.resource=/sys.conf " /path/job.jar --configFile /path/myConfig.conf --env dev

sudo -u myid /opt/spark-1.5/bin/spark-submit --class com.company.MyJob --master yarn-cluster --queue myqueue --num-executors 8  --executor-memory 12g --driver-memory 4g --executor-cores 8 --driver-java-options "-XX:MaxPermSize=2048m -Dconfig.resource=/sys.conf "  --files /path/myConfig.conf /path/my.jar --env dev --configFile myConfig.conf

Reference:

http://spark.apache.org/docs/latest/running-on-yarn.html
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete

  2. I'm trying to submit a Spark app from local machine Terminal to my Cluster. I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine

    I'm using

    bin/spark-submit
    --class com.my.application.XApp
    --master yarn-cluster --executor-memory 100m
    --num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
    1000
    and getting error

    Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/Dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar does not exist
    I can see in my service list ,

    YARN + MapReduce2 2.7.1.2.3 Apache Hadoop NextGen MapReduce (YARN)
    Spark 1.4.1.2.3 Apache Spark is a fast and general engine for
    large-scale data processing.
    already installed.

    My spark-env.sh in local machine

    export HADOOP_CONF_DIR=/Users/nish1013/Dev/hadoop-2.7.1/etc/hadoop
    Has anyone encountered similar before ?

    ReplyDelete