Install & Run Apache Flink on Multi-node Cluster

Free Flink course with real-time projects Start Now!!

In this blog, we will learn how to install Apache Flink in cluster mode on Ubuntu 14.04. Setup of Flink on multiple nodes is also called Flink in Distributed mode.

This blog provides step by step tutorial to install Apache Flink on multi-node cluster. Apache Flink is lightening fast cluster computing is also know as 4G of Big Data.

Install Apache Flink on Multi-node Cluster

Follow the steps given below to install Apache Flink on multi-node cluster-

2.1. Platform

I. Platform Requirements

Operating system: Ubuntu 14.04 or later, we can also use other Linux flavors like: CentOS, Redhat, etc.
Java 7.x or higher

II. Configure & Setup Platform

If you are using Windows / Mac OS you can create virtual machine and install Ubuntu using VMWare Player, alternatively you can create virtual machine and install Ubuntu using Oracle Virtual Box

2.2. Prerequisites

I. Install Java 7

NOTE: Install Java on all the nodes of the cluster

a. Install python-software properties

Apache Flink requires Java to be installed as it runs on JVM. Firstly We need to install python-software-properties to add java repositories. To download and install python-software-properties use the following command.
[php]dataflair@ubuntu:~$ Sudo apt-get install python-software-properties [/php]

b. Add Repository

To add repository run the below command in terminal:
[php]dataflair@ubuntu:~$ sudo add-apt-repository ppa:webupd8team/java [/php]

c. Update the source list

[php]dataflair@ubuntu:~$ sudo apt-get update [/php]

d. Install java

Now we will download and install the Java. To download and install Java run the below command in terminal:
[php]dataflair@ubuntu:~$ sudo apt-get install oracle-java7-installer[/php]
On executing this command Java gets automatically start downloading and gets installed.
To check whether installation procedure gets successfully completed and a completely working Java is installed or not, we have to use the below command:
[php]dataflair@ubuntu:~$ java -version[/php]

II. Configure SSH

SSH means secured shell which is used for the remote login. We can login to a remote machine using SSH. Now we need to configure passwordless SSH.

Passwordless SSH means without a password we can login to a remote machine. Password less SSH setup is required for remote script invocation. Automatically remotely master will start the demons on slaves.

a. Install Open SSH Server-Client

[php]$ sudo apt-get install openssh-server openssh-client[/php]

b. Generate Key Pairs

[php]$ ssh-keygen -t rsa -P “”[/php]
It will ask “Enter the name of file in which to save the key (/home/dataflair/.ssh/id_rsa):” let it be default, don’t specify any path just press “Enter”. Now it will be available in the default path i.e. “.ssh”.

To check the default path use command “$ls .ssh/” and you will see that two files are created “id_rsa” which is a private key and “id_rsa.pub” which is a public key.

c. Configure password-less SSH

Copy the contents of “id_rsa.pub” of master into the “authorized_keys” files of all the slaves and master
[php]$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys[/php]

d. Check by SSH to all the hosts

[php]$ ssh localhost[/php]
[php]$ ssh <SLAVE-IP> [/php]
It should not ask for any password and you can easily get logged into remote machine since we have configured passwordless SSH

2.3. Install Apache Flink in Cluster Mode

I. Install Flink on Master

a. Download the Flink Setup

Download the Flink Setup from its official website http://flink.apache.org/downloads.html

b. Untar the file

In order to extract all the contents of compressed Apache Flink file package, use the below command:
[php]dataflair@ubuntu:~$ tar xzf flink-1.1.3-bin-hadoop26-scala_2.11.tgz[/php]

c. Rename the directory

[php]dataflair@ubuntu:~$ mv flink-1.1.3/ flink [/php]

d. Setup Configuration

i. Go to Flink conf directory

[php]dataflair@ubuntu:~$ cd flink[/php]
[php]dataflair@ubuntu:~/flink$ cd conf [/php]

ii. Add the entry of Master

Choose a master node (JobManager) and set the jobmanager.rpc.address in conf/flink-conf.yaml to its IP or hostname. Make sure that all nodes in your cluster have the same jobmanager.rpc.address configured.
[php]dataflair@ubuntu:~/flink/conf$ nano flink-conf.yaml
Add this line: jobmanager.rpc.address: 192.168.1.3[/php]

iii. Add the entry of all the Slaves

Add the IPs or hostnames (one per line) of all worker nodes (TaskManager) to the slaves files in conf/slaves. To configure file use the following command.
[php]dataflair@ubuntu:~/flink/conf$ nano slaves [/php]
Enter ip addresses like this –
192.168.1.4
192.168.1.5

II. Install Flink on Slaves

a. Copy configured setup from master to all the slaves

We will create a tar of configured Flink setup and copy it on all the slaves.

i. Create tar-ball of configured setup:

[php] $ tar czf flink.tar.gz flink [/php]
NOTE: Run this command on Master

ii. Copy the configured tar-ball on all the slaves

[php] $ scp flink.tar.gz 192.168.1.4:~
$ scp flink.tar.gz 192.168.1.5:~[/php]
NOTE: Run this command on Master

iii. Un-tar configured hadoop setup on all the slaves

[php] $ tar xzf flink.tar.gz [/php]
NOTE: Run this command on all the slaves

2.4. Start that Flink Cluster

I. Start the cluster

To start the cluster run below script, it will start all the daemons running on master as well as slaves.
[php]dataflair@ubuntu:~/flink/$ bin/start-cluster.sh[/php]
NOTE: Run this command on Master

II. Check whether services have been started

a. Check daemons on Master

[php] $ jps
JobManager [/php]

b. Check daemons on Slaves

[php] $ jps
TaskManager [/php]
Now the cluster has been setup successfully, you can play with the cluster..!!

III. Stop the cluster

To stop the cluster run below script, it will stop all the daemons running on master as well as slaves
[php]dataflair@ubuntu:~/flink/$ bin/stop-cluster.sh[/php]
Follow this tutorial for real life use-case of Apache Flink.

Spark or Flink which will be the successor of Hadoop-MapReduce, Refer Spark vs Flink comparison Guide

If you are Happy with DataFlair, do not forget to make us happy with your positive feedback on Google

Shawnee says:
August 19, 2016 at 5:18 pm
Great website. Plenty of useful information here. I’m sending it
to a few pals ans additionally sharing in delicious.
And obviously, thank you to your sweat!
Soad Elshahat says:
April 24, 2018 at 10:22 pm
Can I install multi-node cluster on my laptop(single machine)?
Gutierrez says:
July 31, 2018 at 12:20 pm
I am not sure about the Java version 7. But the original documentations asks to use java 1.8.x or higher. https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/cluster_setup.html