Install Apache Spark on Multi-Node Cluster

DataFlair Team

8 years ago

1. Objective

This Spark tutorial explains how to install Apache Spark on a multi-node cluster. This guide provides step by step instructions to deploy and configure Apache Spark on the real multi-node cluster. Once the setup and installation are done you can play with Spark and process data.

2. Steps to install Apache Spark on multi-node cluster

Follow the steps given below to easily install Apache Spark on a multi-node cluster.

i. Recommended Platform

OS – Linux is supported as a development and deployment platform. You can use Ubuntu 14.04 / 16.04 or later (you can also use other Linux flavors like CentOS, Redhat, etc.). Windows is supported as a dev platform. (If you are new to Linux follow this Linux commands manual).
Spark – Apache Spark 2.x

For Apache Spark Installation On Multi-Node Cluster, we will be needing multiple nodes, either you can use Amazon AWS or follow this guide to setup virtual platform using VMWare player.

ii. Install Spark on Master

a. Prerequisites

Add Entries in hosts file

Edit hosts file
[php]sudo nano /etc/hosts[/php]
Now add entries of master and slaves
[php]MASTER-IP master
SLAVE01-IP slave01
SLAVE02-IP slave02[/php]
(NOTE: In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of the corresponding IP)

Install Java 7 (Recommended Oracle Java)

[php]sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer[/php]

Install Scala

[php]sudo apt-get install scala[/php]

Configure SSH

Install Open SSH Server-Client

[php]sudo apt-get install openssh-server openssh-client[/php]

Generate Key Pairs

[php]ssh-keygen -t rsa -P “”[/php]

Configure passwordless SSH

Copy the content of .ssh/id_rsa.pub (of master) to .ssh/authorized_keys (of all the slaves as well as master)

Check by SSH to all the Slaves

[php]ssh slave01
ssh slave02[/php]

b. Install Spark

Download Spark

You can download the latest version of spark from http://spark.apache.org/downloads.html.

Untar Tarball

tar xzf spark-2.0.0-bin-hadoop2.6.tgz
(Note: All the scripts, jars, and configuration files are available in newly created directory “spark-2.0.0-bin-hadoop2.6”)

Setup Configuration

Edit .bashrc

Now edit .bashrc file located in user’s home directory and add following environment variables:
[php]export JAVA_HOME=<path-of-Java-installation> (eg: /usr/lib/jvm/java-7-oracle/)
export SPARK_HOME=<path-to-the-root-of-your-spark-installation> (eg: /home/dataflair/spark-2.0.0-bin-hadoop2.6/)
export PATH=$PATH:$SPARK_HOME/bin[/php]
(Note: After above step restart the Terminal/Putty so that all the environment variables will come into effect)

Edit spark-env.sh

Now edit configuration file spark-env.sh (in $SPARK_HOME/conf/) and set following parameters:
Note: Create a copy of template of spark-env.sh and rename it:
[php]cp spark-env.sh.template spark-env.sh[/php]
[php]export JAVA_HOME=<path-of-Java-installation> (eg: /usr/lib/jvm/java-7-oracle/)
export SPARK_WORKER_CORES=8[/php]

Add Salves

Create configuration file slaves (in $SPARK_HOME/conf/) and add following entries:
[php]slave01
slave02[/php]
“Apache Spark has been installed successfully on Master, now deploy Spark on all the Slaves”

iii. Install Spark On Slaves

a. Setup Prerequisites on all the slaves

Run following steps on all the slaves (or worker nodes):

“1.1. Add Entries in hosts file”
“1.2. Install Java 7”
“1.3. Install Scala”

b. Copy setups from master to all the slaves

Create tarball of configured setup

[php]tar czf spark.tar.gz spark-2.0.0-bin-hadoop2.6[/php]
NOTE: Run this command on Master

Copy the configured tarball on all the slaves

[php]scp spark.tar.gz slave01:~[/php]
NOTE: Run this command on Master
[php]scp spark.tar.gz slave02:~[/php]
NOTE: Run this command on Master

c. Un-tar configured spark setup on all the slaves

[php]tar xzf spark.tar.gz[/php]
NOTE: Run this command on all the slaves
“Congratulations Apache Spark has been installed on all the Slaves. Now Start the daemons on the Cluster”

iv. Start Spark Cluster

a. Start Spark Services

[php]sbin/start-all.sh[/php]
Note: Run this command on Master

b. Check whether services have been started

Check daemons on Master

[php]jps
Master[/php]

Check daemons on Slaves

[php]jps
Worker[/php]

v. Spark Web UI

a. Spark Master UI

Browse the Spark UI to know about worker nodes, running application, cluster resources.
http://MASTER-IP:8080/

b. Spark application UI

http://MASTER-IP:4040/

vi. Stop the Cluster

a. Stop Spark Services

Once all the applications have finished, you can stop the spark services (master and slaves daemons) running on the cluster
[php]sbin/stop-all.sh[/php]
Note: Run this command on Master
After Apache Spark installation, I recommend learning Spark RDD, DataFrame, and Dataset. You can proceed further with Spark shell commands to play with Spark.

So, this was all in how to Install Apache Spark. Hope you like our explanation.

3. Conclusion – Install Apache Spark

After installing the Apache Spark on the multi-node cluster you are now ready to work with Spark platform. Now you can play with the data, create an RDD, perform operations on those RDDs over multiple nodes and much more.
If you have any query to install Apache Spark, so, feel free to share with us. We will be happy to solve them.
See Also-

Reference for Spark