Alvin's Big Data Notebook : Configuring Resource Usage in Spark

The Standalone cluster manager works by spreading out each application across the maximum number of executors by default. For example, suppose that you have a 20-node cluster with 4-core machines, and you submit an application with --executor-memory 1G and --total-executor-cores 8.
Then Spark will launch eight executors, each with 1 GB of RAM, on different machines. Spark does this by default to give applications a chance to achieve data locality for distributed filesystems running on the same machines (e.g., HDFS), because these systems typically have data spread out across all nodes.
If you prefer, you can instead ask Spark to consolidate executors on as few nodes as possible, by setting the config property spark.deploy.spreadOut to falsein conf/spark-defaults.conf. In this case, the preceding application would get only two executors, each with 1 GB RAM and four cores.

Using YARN in Spark is straightforward: you set an environment variable that points to your Hadoop configuration directory, then submit jobs to a special master URL with spark-submit.

export HADOOP_CONF_DIR="..."
spark-submit --master yarn yourapp

As with the Standalone cluster manager, there are two modes to connect your application to the cluster: client mode, where the driver program for your application runs on the machine that you submitted the application from, and cluster mode, where the driver also runs inside a YARN container. You can set the mode to use via the --deploy-mode argument to spark-submit.

Alvin's Big Data Notebook

Thursday, 18 June 2015

Configuring Resource Usage in Spark

No comments:

Post a Comment