Installing and Running Spark on YARN

There is a lot of talk about Spark these days, and I really wanted to try it on a real cluster with a large data set, not a VM. Unfortunately we do not have Spark installed since we use Hadoop 2.4 from Hortonworks 2.1.7 distribution.

I thought it may be impossible for me to install Spark for a proof of concept since I do not have a permission to deploy any packages to the cluster nodes. Thanks to YARN I do not need to pre-deploy anything to nodes, and as it turned out it was very easy to install and run Spark on YARN.

Here are the steps I followed to install and run Spark on my cluster.

Download Scala (Optional)

Later I realized that spark-shell does not need Scala, but may be I will need it to compile code for spark-submit, will see :) I downloaded Scala scala-2.10.4.tar file and unpacked it to my home directory:

tar xvf scala-2.10.4.tar
mv scala-2.10.4 scala

Note that you can also use Spark Python shell called PySpark.

Download Spark

Then I downloaded Spark binaries pre-built for Hadoop 2.4 spark-1.2.0-bin-hadoop2.4.tgz and also unpacked it to my home directory:

tar zxvf spark-1.2.0-bin-hadoop2.4.tgz
mv spark-1.2.0-bin-hadoop2.4 spark

Configure Spark

Copy spark/conf/spark-env.sh.template:

cp spark/conf/spark-env.sh.template spark/conf/spark-env.sh

Now edit spark/conf/spark-env.sh and specify the location of Hadoop configuration directory and a YARN job queue where you have permissions to submit jobs:

export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_YARN_QUEUE=dev

That’s it. Now let’s run Spark shell and do a simple data analysis.

Run Spark Shell

Now you can run spark-shell. This is a great tool to type and run queries interactively. To run it on YARN specify --master yarn-client:

spark/bin/spark-shell --master yarn-client

Note that spark-shell started ApplicationMaster immediately (you can see application_id and URL from its startup logs):

15/02/05 08:20:59 INFO client.RMProxy: Connecting to ResourceManager 
...
15/02/05 08:21:04 INFO impl.YarnClientImpl: Submitted application application_1418792282603_1186852
15/02/05 08:21:05 INFO yarn.Client: Application report for application_1418792282603_1186852
... tracking URL: http://hdm:8088/proxy/application_1418792282603_1186852/

If you open the tracking URL, you will see Spark UI:

spark_install

Browse Data

I have a sample file in HDFS having the following content:

[dtolpeko ~]$ hadoop fs -cat hello.txt
hello world hello

In Scala shell create a RDD:

val file = sc.textFile("hdfs://hdm:8020/user/dmtolpeko/hello.txt")

Now let’s counts the number of occurrences of each word in the file:

file.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _).collect()
res1: Array[(String, Int)] = Array((hello,2), (world,1))

Do you still remember your first Word Count program in Java?

Leave a Reply