There is a lot of talk about Spark these days, and I really wanted to try it on a real cluster with a large data set, not a VM. Unfortunately we do not have Spark installed since we use Hadoop 2.4 from Hortonworks 2.1.7 distribution.
I thought it may be impossible for me to install Spark for a proof of concept since I do not have a permission to deploy any packages to the cluster nodes. Thanks to YARN I do not need to pre-deploy anything to nodes, and as it turned out it was very easy to install and run Spark on YARN.
Here are the steps I followed to install and run Spark on my cluster.
Download Scala (Optional)
Later I realized that
spark-shell does not need Scala, but may be I will need it to compile code for
spark-submit, will see I downloaded Scala
scala-2.10.4.tar file and unpacked it to my home directory:
tar xvf scala-2.10.4.tar mv scala-2.10.4 scala
Note that you can also use Spark Python shell called PySpark.
Then I downloaded Spark binaries pre-built for Hadoop 2.4
spark-1.2.0-bin-hadoop2.4.tgz and also unpacked it to my home directory:
tar zxvf spark-1.2.0-bin-hadoop2.4.tgz mv spark-1.2.0-bin-hadoop2.4 spark
cp spark/conf/spark-env.sh.template spark/conf/spark-env.sh
spark/conf/spark-env.sh and specify the location of Hadoop configuration directory and a YARN job queue where you have permissions to submit jobs:
export HADOOP_CONF_DIR=/etc/hadoop/conf export SPARK_YARN_QUEUE=dev
That’s it. Now let’s run Spark shell and do a simple data analysis.
Run Spark Shell
Now you can run
spark-shell. This is a great tool to type and run queries interactively. To run it on YARN specify
spark/bin/spark-shell --master yarn-client
spark-shell started ApplicationMaster immediately (you can see application_id and URL from its startup logs):
15/02/05 08:20:59 INFO client.RMProxy: Connecting to ResourceManager ... 15/02/05 08:21:04 INFO impl.YarnClientImpl: Submitted application application_1418792282603_1186852 15/02/05 08:21:05 INFO yarn.Client: Application report for application_1418792282603_1186852 ... tracking URL: http://hdm:8088/proxy/application_1418792282603_1186852/
If you open the tracking URL, you will see Spark UI:
I have a sample file in HDFS having the following content:
[dtolpeko ~]$ hadoop fs -cat hello.txt hello world hello
In Scala shell create a RDD:
val file = sc.textFile("hdfs://hdm:8020/user/dmtolpeko/hello.txt")
Now let’s counts the number of occurrences of each word in the file:
file.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _).collect()
res1: Array[(String, Int)] = Array((hello,2), (world,1))
Do you still remember your first Word Count program in Java?