Alvin's Big Data Notebook : Install Spark1.2 in HDP2.2 and Run Spark on YARN

1. Use wget to download the Spark tarball:

$wget [http://public-repo-1.hortonworks.com/HDP-LABS/Projects/spark/1.2.0/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041.tgz]

2. Copy the downloaded Spark tarball to your Hadoop cluster.

$scp spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041.tgz root@127.0.0.1:/root

3. Set up the environment

1. Set environment variable: export YARN_CONF_DIR=/etc/hadoop/conf
2. Create a file SPARK_HOME/conf/spark-defaults.conf and add the following settings:

spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041

4. Run the Spark Pi Example

[root@sandbox spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041]# ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

5. Run Spark WordCount

Copy data

$hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties /tmp/data

$./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

At the Scala REPL,

val file = sc.textFile("hdfs://hdfsip:8020/tmp/data")
val counts = file.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_)
counts.saveAsTextFile("hdfs://hdfsip:8020/tmp/wordcount")
counts.toArray().foreach(println)

6. Use Spark Job History Server

1. Add History Services to SPARK_HOME/conf/spark-defaults.conf
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider
spark.yarn.historyServer.address localhost:18080

2. Start/Stop the Spark History Server
$./sbin/start(stop)-history-server.sh

7. Install Gfortan for MLlib
$ sudo yum install gcc-gfortran

Otherwise,

java.lang.UnsatisfiedLinkError:  
org.jblas.NativeBlas.dposv(CII[DII[DII)I  
    at org.jblas.NativeBlas.dposv(Native Method)  
    at org.jblas.SimpleBlas.posv(SimpleBlas.java:369)  
    at org.jblas.Solve.solvePositive(Solve.java:68)

Reference:
http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
http://spark.apache.org/docs/latest/running-on-yarn.html