Spark on YARN Submit Errors on Hortonworks

When you start Spark on YARN using Spark shell as

spark/bin/spark-shell --master yarn-client

You can get the following errors on Hortonworks:

...
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

:10: error: not found: value sqlContext
      import sqlContext.implicits._
:10: error: not found: value sqlContext
       import sqlContext.sql

Additionally when you open the Application Master log you can see:

Log Type: stderr
Log Upload Time: Tue Nov 17 06:59:35 -0800 2015
Log Length: 87
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

To solve this issue, edit spark-defaults.conf and specify:

spark.driver.extraJavaOptions -Dhdp.version=current
spark.yarn.am.extraJavaOptions -Dhdp.version=current

In my case this helped launch the Spark shell successfully, and I could see the command prompt:

15/11/17 07:37:05 INFO repl.SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.

scala>

I used Spark 1.5.2 and HDP 2.2.4.8

Installing and Running Spark on YARN

There is a lot of talk about Spark these days, and I really wanted to try it on a real cluster with a large data set, not a VM. Unfortunately we do not have Spark installed since we use Hadoop 2.4 from Hortonworks 2.1.7 distribution.

I thought it may be impossible for me to install Spark for a proof of concept since I do not have a permission to deploy any packages to the cluster nodes. Thanks to YARN I do not need to pre-deploy anything to nodes, and as it turned out it was very easy to install and run Spark on YARN.

Here are the steps I followed to install and run Spark on my cluster.

Download Scala (Optional)

Later I realized that spark-shell does not need Scala, but may be I will need it to compile code for spark-submit, will see :) I downloaded Scala scala-2.10.4.tar file and unpacked it to my home directory:

tar xvf scala-2.10.4.tar
mv scala-2.10.4 scala

Note that you can also use Spark Python shell called PySpark.

Download Spark

Then I downloaded Spark binaries pre-built for Hadoop 2.4 spark-1.2.0-bin-hadoop2.4.tgz and also unpacked it to my home directory:

tar zxvf spark-1.2.0-bin-hadoop2.4.tgz
mv spark-1.2.0-bin-hadoop2.4 spark

Configure Spark

Copy spark/conf/spark-env.sh.template:

cp spark/conf/spark-env.sh.template spark/conf/spark-env.sh

Now edit spark/conf/spark-env.sh and specify the location of Hadoop configuration directory and a YARN job queue where you have permissions to submit jobs:

export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_YARN_QUEUE=dev

That’s it. Now let’s run Spark shell and do a simple data analysis.

Continue reading