Install Apache Spark on Multi-Node Cluster

Boost your career with Free Big Data Courses!!

1. Objective

This Spark tutorial explains how to install Apache Spark on a multi-node cluster. This guide provides step by step instructions to deploy and configure Apache Spark on the real multi-node cluster. Once the setup and installation are done you can play with Spark and process data.

2. Steps to install Apache Spark on multi-node cluster

Follow the steps given below to easily install Apache Spark on a multi-node cluster.

i. Recommended Platform

OS – Linux is supported as a development and deployment platform. You can use Ubuntu 14.04 / 16.04 or later (you can also use other Linux flavors like CentOS, Redhat, etc.). Windows is supported as a dev platform. (If you are new to Linux follow this Linux commands manual).
Spark – Apache Spark 2.x

For Apache Spark Installation On Multi-Node Cluster, we will be needing multiple nodes, either you can use Amazon AWS or follow this guide to setup virtual platform using VMWare player.

ii. Install Spark on Master

a. Prerequisites

Add Entries in hosts file

Edit hosts file
[php]sudo nano /etc/hosts[/php]
Now add entries of master and slaves
[php]MASTER-IP master
SLAVE01-IP slave01
SLAVE02-IP slave02[/php]
(NOTE: In place of MASTER-IP, SLAVE01-IP, SLAVE02-IP put the value of the corresponding IP)

Install Java 7 (Recommended Oracle Java)

[php]sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer[/php]

Install Scala

[php]sudo apt-get install scala[/php]

Configure SSH

Install Open SSH Server-Client

[php]sudo apt-get install openssh-server openssh-client[/php]

Generate Key Pairs

[php]ssh-keygen -t rsa -P “”[/php]

Configure passwordless SSH

Copy the content of .ssh/id_rsa.pub (of master) to .ssh/authorized_keys (of all the slaves as well as master)

Check by SSH to all the Slaves

[php]ssh slave01
ssh slave02[/php]

b. Install Spark

Download Spark

You can download the latest version of spark from http://spark.apache.org/downloads.html.

Untar Tarball

tar xzf spark-2.0.0-bin-hadoop2.6.tgz
(Note: All the scripts, jars, and configuration files are available in newly created directory “spark-2.0.0-bin-hadoop2.6”)

Setup Configuration

Edit .bashrc

Now edit .bashrc file located in user’s home directory and add following environment variables:
[php]export JAVA_HOME=<path-of-Java-installation> (eg: /usr/lib/jvm/java-7-oracle/)
export SPARK_HOME=<path-to-the-root-of-your-spark-installation> (eg: /home/dataflair/spark-2.0.0-bin-hadoop2.6/)
export PATH=$PATH:$SPARK_HOME/bin[/php]
(Note: After above step restart the Terminal/Putty so that all the environment variables will come into effect)

Edit spark-env.sh

Now edit configuration file spark-env.sh (in $SPARK_HOME/conf/) and set following parameters:
Note: Create a copy of template of spark-env.sh and rename it:
[php]cp spark-env.sh.template spark-env.sh[/php]
[php]export JAVA_HOME=<path-of-Java-installation> (eg: /usr/lib/jvm/java-7-oracle/)
export SPARK_WORKER_CORES=8[/php]

Add Salves

Create configuration file slaves (in $SPARK_HOME/conf/) and add following entries:
[php]slave01
slave02[/php]
“Apache Spark has been installed successfully on Master, now deploy Spark on all the Slaves”

iii. Install Spark On Slaves

a. Setup Prerequisites on all the slaves

Run following steps on all the slaves (or worker nodes):

“1.1. Add Entries in hosts file”
“1.2. Install Java 7”
“1.3. Install Scala”

b. Copy setups from master to all the slaves

Create tarball of configured setup

[php]tar czf spark.tar.gz spark-2.0.0-bin-hadoop2.6[/php]
NOTE: Run this command on Master

Copy the configured tarball on all the slaves

[php]scp spark.tar.gz slave01:~[/php]
NOTE: Run this command on Master
[php]scp spark.tar.gz slave02:~[/php]
NOTE: Run this command on Master

c. Un-tar configured spark setup on all the slaves

[php]tar xzf spark.tar.gz[/php]
NOTE: Run this command on all the slaves
“Congratulations Apache Spark has been installed on all the Slaves. Now Start the daemons on the Cluster”

iv. Start Spark Cluster

a. Start Spark Services

[php]sbin/start-all.sh[/php]
Note: Run this command on Master

b. Check whether services have been started

Check daemons on Master

[php]jps
Master[/php]

Check daemons on Slaves

[php]jps
Worker[/php]

v. Spark Web UI

a. Spark Master UI

Browse the Spark UI to know about worker nodes, running application, cluster resources.
http://MASTER-IP:8080/

b. Spark application UI

http://MASTER-IP:4040/

vi. Stop the Cluster

a. Stop Spark Services

Once all the applications have finished, you can stop the spark services (master and slaves daemons) running on the cluster
[php]sbin/stop-all.sh[/php]
Note: Run this command on Master
After Apache Spark installation, I recommend learning Spark RDD, DataFrame, and Dataset. You can proceed further with Spark shell commands to play with Spark.

So, this was all in how to Install Apache Spark. Hope you like our explanation.

3. Conclusion – Install Apache Spark

After installing the Apache Spark on the multi-node cluster you are now ready to work with Spark platform. Now you can play with the data, create an RDD, perform operations on those RDDs over multiple nodes and much more.
If you have any query to install Apache Spark, so, feel free to share with us. We will be happy to solve them.
See Also-

Reference for Spark

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google

Tags: apache spark install Apache Spark install spark install spark on cluster learn spark spark cluster spark multi-node spark multi-node cluster spark setup spark standalone mode

Ashish Garg says:
November 21, 2016 at 7:34 am
Thanks for the this great tutorial
Don’t we need to setup the HDFS to share the repository with master and all workers?
Can you share the tutorial for this?
Reply
- Djibrina B. says:
  September 8, 2017 at 8:12 am
  Hi,
  i am facing the same issue and i would like to know if you get some solutions. thx
  Reply
- Santosh says:
  September 9, 2017 at 4:58 am
  You can follow this link to setup multi-node hadoop cluster:
  http://data-flair.training/blogs/install-hadoop-2-x-ubuntu-hadoop-multi-node-cluster/
  Reply
  - Djibrina B. says:
    September 11, 2017 at 6:12 am
    Thank you
    I meant Spark- HDFS. I do not kow actually b it is the same set up.
    My point is how to set up HDFS in Spark.
    Reply
Kelvin says:
November 24, 2016 at 7:13 pm
Oh my goodness! Awesome article dude! Many thanks.
Reply
Mady says:
December 22, 2016 at 6:18 am
Fantastic blog to install Spark 2 in easy steps. Please share some Spark practicals as well to start with.
Reply
Nitin says:
February 9, 2017 at 7:23 am
Thanks for this lovely article. However, I am facing one problem when doing “jps Master” it is throwing “RMI Registry not available at Master:1099
Connection refused to host: Master; nested exception is:
java.net.ConnectException: Connection refused”
this error. Can you help?
Reply
- Support says:
  February 9, 2017 at 10:57 am
  Dear Nitin,
  Please check the services by running following command (rather then jps master)
  jps
  Reply
  - Nitin says:
    February 10, 2017 at 11:31 am
    Thanks. It is working fine.
    Reply
Krish Rajaram says:
February 26, 2017 at 5:02 am
Thanks for this post. I followed these steps and successfully created the cluster with spark 2.1.0. While I was testing a simple dataframe writer, it fails to write the output file to the target path. This happens only when run through spark-submit. But when I run the commands from spark-shell the output file is successfully stored in the target path. Did anyone encounter this issue?
Reply
Abdel says:
March 13, 2017 at 1:44 pm
Hi ! thanks for this article it’s very helpful. however I did not undestand this part of your tutorial:
2.3.3 Add salves:
Create configuration file slaves (in $SPARK_HOME/conf/) and add following entries:
1 slave01
2 slave02
Do we have to add this entries in the file spark-env.sh or what ?
Thanks in advance
Reply
- Ganesh says:
  May 23, 2018 at 5:02 pm
  Add these entries into a new slaves file like following:
  $cp slaves.template slaves (to copy the slaves.template file to another file named as slaves)
  $vim slaves
  master
  slave01
  slave02
  Reply
Swaroop P says:
April 20, 2017 at 1:34 pm
I followed all your steps as you mentioned.
I am unable to connect workers. Only master is acting as master and worker form me.
Is the above process required hadoop installation? Because i didn’t install hadoop or yarn.
Please help me ASAP
Reply
- Dinesh Dev Pandey says:
  May 9, 2017 at 5:30 am
  Hi,
  I was facing the same problem. I checked the log generated for master. I found –
  “Service MasterUI is started on port 8081”.
  I tried with http: //Master_IP: 8081 and it worked for me.
  You can also check logs once.
  Reply
Ugur says:
May 2, 2017 at 6:54 am
Thanks for your awesome sharing,
However, I have a problem. I setup multi-node spark according to your guidance but i cannot access with ip of master node(x.y.z.t:8080). How can i solve the problem?
Reply
- Emiliano Amendola says:
  July 7, 2017 at 4:27 pm
  You need to add these two lines in the ~/$SPARK_HOME/conf/spark-env.sh file, in your master and worker nodes:
  export SPARK_MASTER_HOST= YOUR.MASTER.IP.ADDRESS
  export SPARK_MASTER_WEBUI_PORT=8080
  Reply
rama says:
June 1, 2017 at 3:34 pm
Thanks for your awesome sharing, I have installed Spark on multiple nodes successfully.
Reply
Djibrina B. says:
September 8, 2017 at 8:10 am
Thx for this article.
However i would like to know how to set up hdfs to enable all workers and master to share the same repository?
I installed a Spark-Cluster with 3 workers and i would like to save a dataframe along all workers. I created on each worker the repository ” home/data/”. It saves but if i read it back, i am geting “lost files error: java.io.FileNotFoundException: file part XXXX does not exist”.
I tried also using parquet and using partitions by column y but i still get the same kind of error “file footer not found ”
Any suggestions please?
Thx
Reply
Vaasir Nisaar says:
November 2, 2017 at 9:13 am
I have installed MapR with 1-Control node and 2-Data nodes but now im going to install Apache Spark with all nodes using Python how im going to develop.
Reply
Begna says:
November 7, 2017 at 11:36 am
Best tutorial, I have wasted my time on other alternatives. Really am happy and helped me a lot for my Project.
Thank thank you
Reply
vijander kumar says:
November 22, 2017 at 2:50 am
Is this setup of spark over Yarn/mesos or standalone ?
Reply
- shabarinath says:
  July 30, 2018 at 8:29 am
  Did u get what is this type of installation, even I am confused
  Reply
Lebin says:
March 15, 2018 at 8:32 am
Very Nice article.
I have a doubt, how to execute the job after configuring the cluster? is it necessary to copy the jar in all the nodes(master as well as in slave)? Will it work if i can do it by only keeping the jar in master node?
Reply
Umesh says:
May 22, 2018 at 9:39 am
Hi,
Thank you for the article . However when I am trying to submit job on master it is not sending it to the slave node. COuld you please help me here?
detailed description:
I am deploying prediction.io on a multinode cluster where training should happen on the worker node. The worker node has been successfully registered with the master.
following are the logs of after starting slaves.sh
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties 18/05/22 06:01:44 INFO Worker: Started daemon with process name: 2208@ip-172-31-6-235 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for TERM 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for HUP 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for INT 18/05/22 06:01:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable 18/05/22 06:01:44 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:01:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:01:44 INFO Utils: Successfully started service ‘sparkWorker’ on port 45057. 18/05/22 06:01:44 INFO Worker: Starting Spark worker 172.31.6.235:45057 with 8 cores, 24.0 GB RAM 18/05/22 06:01:44 INFO Worker: Running Spark version 2.1.1 18/05/22 06:01:44 INFO Worker: Spark home: /home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6 18/05/22 06:01:45 INFO Utils: Successfully started service ‘WorkerUI’ on port 8081. 18/05/22 06:01:45 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http:// 172.31.6.235:8081 18/05/22 06:01:45 INFO Worker: Connecting to master ip-172-31-5-119.ap-southeast-1.compute.internal:7077… 18/05/22 06:01:45 INFO TransportClientFactory: Successfully created connection to ip-172-31-5-119.ap-southeast-1.compute.internal/172.31.5.119:7077 after 19 ms (0 ms spent in bootstraps) 18/05/22 06:01:45 INFO Worker: Successfully registered with master spark://ip-172-31-5-119.ap-southeast-1.compute.internal:7077
Now the issues:
if I launch one slave on master and one slave my other node:
1.1 if the slave of the master node is given fewer resources it will give some unable to re-shuffle error.
1.2 if I give more resources to the worker on the master node the all the execution happens on master node, it does not send any execution to the slave node.
If I do not start a slave on the master node:
2.1 I get the following error:
WARN] [TaskSchedulerImpl] Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have assigned 24gb ram to the worker and 8 cores.
However, while I start the process following are the logs I get on slave machine:
18/05/22 06:16:00 INFO Worker: Asked to launch executor app-20180522061600-0001/0 for PredictionIO Training: com.actionml.RecommendationEngine 18/05/22 06:16:00 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:16:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:16:00 INFO ExecutorRunner: Launch command: “/usr/lib/jvm/java-8-oracle/bin/java” “-cp” “./:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/conf/:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/jars/*” “-Xmx4096M” “-Dspark.driver.port=45049” “org.apache.spark.executor.CoarseGrainedExecutorBackend” “–driver-url” “spark://[email protected]:45049” “–executor-id” “0” “–hostname” “172.31.6.235” “–cores” “8” “–app-id” “app-20180522061600-0001” “–worker-url” “spark://[email protected]:45057” 18/05/22 06:16:50 INFO Worker: Asked to kill executor app-20180522061600-0001/0 18/05/22 06:16:50 INFO ExecutorRunner: Runner thread for executor app-20180522061600-0001/0 interrupted 18/05/22 06:16:50 INFO ExecutorRunner: Killing process! 18/05/22 06:16:51 INFO Worker: Executor app-20180522061600-0001/0 finished with state KILLED exitStatus 143 18/05/22 06:16:51 INFO Worker: Cleaning up local directories for application app-20180522061600-0001 18/05/22 06:16:51 INFO ExternalShuffleBlockResolver: Application app-20180522061600-0001 removed, cleanupLocalDirs = true
Thanks!
Reply
yolanda li says:
May 29, 2018 at 4:55 am
thanks , it really helped.
Reply
Aman Agarwal says:
June 7, 2018 at 1:06 pm
Hi,
I would like to ask how to install spark to use it as an execution engine for hive. I already have hive installed in a multi-node cluster and now wants to use spark as execution engine instead of MR.
Reply
Aman Agarwal says:
June 8, 2018 at 7:18 am
Hi,
I want to use spark as hive’s execution engine. I have hive installed on a cluster of 1000 nodes and now want to install spark to use hive on spark, how to install spark in order to use as hive’s execution engine.
Reply
rohit says:
July 4, 2019 at 3:19 pm
what if slaves usernames are different.
Reply
SkGa says:
May 29, 2020 at 3:17 am
works great. Make sure to add port of master to firewall so that workers appear in the web ui.
Reply
NGUYỄN THỊ NGÁT says:
May 28, 2021 at 9:39 pm
Tôi đã thiết lập 3 máy và chạy thử. Spark-submit chạy không lỗi. Tuy nhiên tôi không thấy kết quả trả về trong stdout. Bạn giúp tôi giải thích cơ chế hoạt động của spark cũng như cách xem kết quả của nó được không?
Ví dụ: chạy bài hồi quy với 20 bản ghi, spark sẽ chia ra cho các worker thực thi, sau đó nó làm thế nào để đưa ra kết quả cuối cùng?
Bên cạnh đó hadoop có phải cài đi kèm? Và file đầu vào phải có tất cả trên các máy?
Cảm ơn.
Reply
- NGUYỄN THỊ NGÁT says:
  May 28, 2021 at 9:42 pm
  I’ve set up 3 machines and have a test run. Spark-submit runs without error. However I don’t see the results returned in stdout.
  Can you help me explain how spark works and how to see its results?
  For example, run a regression with 20 records, spark will split between the workers to execute, then how does it give the final result?
  Besides hadoop must be installed together? And the input file must be available on all machines?
  Thanks.
  Reply
- NGUYỄN THỊ NGÁT says:
  May 28, 2021 at 9:45 pm
  I’ve set up 3 machines and have a test run. Spark-submit runs without error. However I don’t see the results returned in stdout. Can you help me explain how spark works and how to see its results?
  For example, run a regression with 20 records, spark will split between the workers to execute, then how does it give the final result?
  Besides hadoop must be installed together? And the input file must be available on all machines?
  Thanks.
  Reply
jean says:
November 11, 2022 at 12:03 pm
Hi,
Thank you for this lovely article. However, I could not start the master’s and workers’ daemons. I get the message below when running the command sbin/start-all.sh.
I spent hours looking for a solution on the web but could not find one. You help would be very appreciated.
Note: I’ve install default jre on ubuntu instead of orable-jave-7 which because I was unable to add the repo ppa:webupd8team/java.
Many thanks
### ERROR MESSAGE ##
starting org.apache.spark.deploy.master.Master, logging to /home/vagrant/spark-3.3.1-bin-hadoop3/logs/spark-vagrant-org.apache.spark.deploy.master.Master-1-worker1.out
failed to launch: nice -n 0 /home/vagrant/spark-3.3.1-bin-hadoop3/bin/spark-class org.apache.spark.deploy.master.Master –host worker1 –port 7077 –webui-port 8080
at java.net.URLClassLoader.access$100(URLClassLoader.java:65)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.net.URLClassLoader$1.run(URLClassLoader.java:349)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:348)
at java.lang.ClassLoader.loadClass(ClassLoader.java:430)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:323)
at java.lang.ClassLoader.loadClass(ClassLoader.java:363)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
/home/vagrant/spark-3.3.1-bin-hadoop3/bin/spark-class: line 96: CMD: bad array subscript
full log in /home/vagrant/spark-3.3.1-bin-hadoop3/logs/spark-vagrant-org.apache.spark.deploy.master.Master-1-worker1.out
Reply

Install Apache Spark on Multi-Node Cluster

1. Objective

2. Steps to install Apache Spark on multi-node cluster

i. Recommended Platform

ii. Install Spark on Master

a. Prerequisites

b. Install Spark

iii. Install Spark On Slaves

a. Setup Prerequisites on all the slaves

b. Copy setups from master to all the slaves

c. Un-tar configured spark setup on all the slaves

iv. Start Spark Cluster

a. Start Spark Services

b. Check whether services have been started

v. Spark Web UI

a. Spark Master UI

b. Spark application UI

vi. Stop the Cluster

a. Stop Spark Services

3. Conclusion – Install Apache Spark

33 Responses

Leave a Reply Cancel reply

About DataFlair

Trending Courses

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Data Science Tutorials

Trending Projects

Trending Programming Tutorials

Trending Tutorials