Install YARN with Hadoop 2 in Pseudo-Distributed Mode

Boost your career with Free Big Data Courses!!

1. Install Yarn with Hadoop 2: Objective

This guide contains very simple and easy to execute step by step documentation to install Yarn with Hadoop 2 on Ubuntu OS. Single Node Hadoop cluster means all master, as well as slave daemons, will run on the same machine. This quickstart briefs how to setup and deploy Hadoop very quickly, after installation you can play with Hadoop. Single Node Hadoop cluster is also called as Hadoop in Pseudo-Distributed Mode. After installing Hadoop you can store your data in HDFS and submit MapReduce and Other applications like Spark.

Install Hadoop 2 with Yarn

Install Hadoop 2 with Yarn

2. Steps to Install Yarn with Hadoop 2 on Ubuntu

2.1. Platform Requirements to Install Yarn with Hadoop 2

Operating system: Ubuntu 14.04 or later, we can also use other Linux flavors like CentOS, Redhat, etc.
Hadoop: We can use the latest version of Apache Hadoop 2.x or Cloudera CDH5.x

2.2. Configure & Setup Platform to install yarn with Hadoop 2

If you are using Windows/Mac OS you can create virtual machine and install Ubuntu using VMWare Player, alternatively, you can create virtual machine and install Ubuntu using Oracle Virtual Box

2.3. Software you need to install before you install Yarn with Hadoop 2

I. Install Java 8

a) Install Python Software Properties
Before adding java repositories we have to first download python-software-properties. In order to get python software properties installed, we need to download it and get it installed by executing below commands in a terminal:

[php] sudo apt-get install python-software-properties[/php]

NOTE: On executing the above command. We need to enter the password because in order to successfully install the Python software properties we need root privileges since we are using “sudo” command. In order to perform any installation or configuration of any software we always need root privileges.

b) Add Repository
Now we need to Add Repository. Surely a question had arrived in your mind that what is a Repository? The repository is simply a collection of list of software for a Linux distribution on a server from where we can easily get the software download and installed using “apt-get” command in Linux OS. So, in this step, we will add a repository manually so that Ubuntu can easily install the Java. To add a Java repositories we will use the following command:

[php] sudo add-apt-repository ppa:webupd8team/java[/php]

Now you need to Press [Enter] button to continue installation.

c) Update the source list
The source list is the location from where Ubuntu download and install the software and also the software gets updated here. It is a good practice to update the source list periodically if we are going to install a new package/software or getting any previously installed software updated. Doing so will make the changes get easily saved and while using the software we will not face any problem. To get the source list updated we need to execute the below command:

[php] sudo apt-get update[/php]

On Executing the above command source list of Ubuntu gets updated.

d) Install Java
As we know that the framework of Hadoop is written in Java. Hence it requires Java to be installed. So, let’s begin by installing Java. In order to install Java, firstly we need to download it and then we will be able to install the Java. So to download and install Java we need to use the below command:

[php] sudo apt-get install oracle-java8-installer[/php]

On executing this command Java gets automatically start downloading and gets installed.

To check whether installation procedure gets successfully completed and a completely working Java is installed or not and to know the version of Java installed we have to use the below command:

[php] java -version[/php]

II. Configure SSH

In order to get a login to remote machine secured shell (SSH) is used. Means we can login to a remote machine using SSH. SSH setup is required to perform different operations on a cluster such as starting, stopping, distributed daemon shell operations. So to work seamlessly, SSH needs to be setup to allow password-less login for the Hadoop user from machines in the cluster. So in order to allow password-less login for the Hadoop user from machines in the cluster, we need to configure password less SSH. Password-less SSH setup is required for remote script invocation. And automatically the daemons get started by the master remotely on slaves.

a) Install Open SSH Server-Client
In order to get a remote login, we need to use these SSH tools.

[php] sudo apt-get install openssh-server openssh-client[/php]

b) Generate Key Pairs
By this public and private key pair gets generated, which we can use for authentication.

[php] ssh-keygen -t rsa -P “”[/php]

Note: Now a message will appear to enter the file in which you want to save the key (/home/hdadmin/.ssh/id_rsa): we don’t need to specify any path just leave it as it is and press “Enter”. Doing this will make it available on the default path i.e. “.ssh”. The default path can be known by using command “$ls .ssh/” and we can see that two files are generated “id_rsa” and “id_rsa.pub”.  “id_rsa” is a private key and “id_rsa.pub” is a public key.

c) Configure passwordless SSH
Now we need to copy the contents of “id_rsa.pub” into the “authorized_keys” file. This can be done by executing the following command:

[php] cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys[/php]

d) Check by SSH to localhost

[php] ssh localhost[/php]

As we have already configured passwordless SSH, Hence it will not ask for any password and we can get logged in to localhost very easily.

2.4. Install Hadoop

In order to get Hadoop-2.6.0-CDH-5.5.1 in pseudo distributed mode installed successfully and easily we need to follow the following steps:

I. Download Hadoop from the below link:

[php]http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.5.1.tar.gz[/php]

We need to copy Hadoop to our Home directory after download process gets completed.

II. Untar Tar-ball

In order to extract all the contents of compressed Hadoop file package we will use the below command:
[php] tar xzf hadoop-2.6.0-cdh5.5.1.tar.gz[/php]

Note: HADOOP_HOME directory (the extracted directory is called as HADOOP_HOME. e.g. hadoop-2.6.0-cdh5.5.1) consist of all the libraries, scripts, configuration files, etc.

III. Setup Configuration

a) Edit .bashrc
Now we need to make changes in Hadoop environment file “.bashrc”, which is present in our system’s home directory. We can easily search this file in our home directory and can make changes in this file with the help of command: $ nano -/.bashrc” and we need to add the following parameters at the end of this file:

[php]export HADOOP_PREFIX=”/home/dataflair/hadoop-2.6.0-cdh5.5.1″
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}[/php]

Note: We should take care of entering the correct path. “/home/dataflair/hadoop-2.6.0-cdh5.5.1” this is the path of my home directory. You can know the path of your home directory by using “$pwd” command.

After the above parameters get added we need to save this file. Press “Ctrl+X” on your keyboard in order to save this file.

Note: After performing the above steps we need to restart the terminal in order to make environment variables effective.

b) Edit hadoop-env.sh
Now we need to make changes in Hadoop configuration file “hadoop-env.sh” which is present in configuration directory (HADOOP_HOME/etc/hadoop) and there we need to set JAVA_HOME:

[php]dataflair@ubuntu:~$cd hadoop-2.6.0-cdh5.5.1/
dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1$ cd etc/hadoop
dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano hadoop-env.sh[/php]

On this file change the path of JAVA_HOME as:

[php]export JAVA_HOME=<path-to-the-root-of-your-Java-installation> (eg: /usr/lib/jvm/java-8-oracle/)[/php]

After changing the path of JAVA_HOME save this file. Press “Ctrl+X” on your keyboard in order to save this file.
Note: “/usr/lib/jvm/java-8-oracle/” is the default path of Java on your system. If you had made changes in your java path then enter the new java path here.

c) Edit core-site.xml

Now we need to make changes in Hadoop configuration file core-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command:

[php]dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano core-site.xml[/php]

And there add below entries between <configuration> </configuration> which is present at the end of this file:

[php]<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/dataflair/hdata</value>
</property>
[/php]

Location of namenode is specified by fs.defaultFS property; we will find the namenode running at 9000 port on localhost. We use hadoop.tmp.dir property to specify the location where temporary as well as permanent data of Hadoop will be stored.

Note: We should take care of entering the correct path “/home/dataflair/hdata” is my location; here you need to specify a location where you have Read Write privileges.

After adding the above parameters save this file. Press “Ctrl+X” on your keyboard in order to save this file, if you are using nano editor.

d) Edit hdfs-site.xml
Now we need to make changes in Hadoop configuration file hdfs-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command:

[php]dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano hdfs-site.xml[/php]

And there add below entries between <configuration> </configuration> which is present at the end of this file:

[php]<property>
<name>dfs.replication</name>
<value>1</value>
</property>
[/php]

Replication factor is specified by dfs.replication property; as it is a single node cluster hence we will set replication to 1.

After adding the above parameters save this file. Press “Ctrl+X” on your keyboard in order to save this file.

e) Edit mapred-site.xml

Now we need to make changes in Hadoop configuration file mapred-site.xml (which is located in HADOOP_HOME/etc/hadoop)

Note: In order to edit mapred-site.xml file we need to first create a copy of file mapred-site.xml.template. A copy of this file can be created using the following command:

[php]dataflair@ubuntu:~/ hadoop-2.6.0-cdh5.5.1/etc/hadoop$ cp mapred-site.xml.template mapred-site.xml[/php]

We will now edit the mapred-site.xml file by using the following command:

[php]dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano mapred-site.xml[/php]

And there add below entries between <configuration> </configuration> which is present at the end of this file:

[php]<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
[/php]

In order to specify which framework should be used for MapReduce, we use mapreduce.framework.name property, yarn is used here.

f) Edit yarn-site.xml
Now we need to make changes in Hadoop configuration file yarn-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command:

[php]dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano yarn-site.xml[/php]

And there add below entries between <configuration> </configuration> which is present at the end of this file:

[php]<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
[/php]

In order to specify auxiliary service need to run with nodemanager “yarn.nodemanager.aux-services” property is used. Here Shuffling is used as auxiliary service. And in order to know the class that should be used for shuffling we user “yarn.nodemanager.aux-services.mapreduce.shuffle.class”.

IV. Start the Hadoop Cluster

a) Format the namenode
[php]dataflair@ubuntu:~$ hdfs namenode -format[/php]
NOTE: After we install Yarn with Hadoop 2 we need to perform this activity only once if this command is again used that all the data will get deleted and namenode would be assigned some other namespaceId. If namenode and datanode Id’s doesn’t get matched, then the datanode daemons will not start while starting the daemons.

b) Start HDFS Daemons
[php]dataflair@ubuntu:~$ start-dfs.sh[/php]

c) Start YARN Daemons
[php]dataflair@ubuntu:~$ start-yarn.sh[/php]

d) Check the status of services
[php]dataflair@ubuntu:~$ jps[/php]

[php]NameNode
DataNode
ResourceManager
NodeManager[/php]

If the above output gets appeared after running jps command, that means you had successfully installed Hadoop. HDFS and MapReduce operations can be performed.

See Also-

This was all about the steps to install YARN with Hadoop 2. If you like this article on Install YARN with Hadoop 2 do let us know in the comment section below or if you have any query regarding how to install YARN with Hadoop, let us know that too and our support team will be Happy to help you.

If you are Happy with DataFlair, do not forget to make us happy with your positive feedback on Google

follow dataflair on YouTube

7 Responses

  1. HIMANISH says:

    Hello
    Is Java programming explained for Map Reduce?
    Plz explain?

  2. Vandana Thakkar says:

    can you please share the steps to install pig and hive on top of these.. and want to know is 4GB RAM enough for all three?

  3. Apple says:

    Hi,
    Thanks for providing this document. It helped me a lot to install in my Ubuntu. I followed all the instruction that you provided. But I’m getting this warning for “start-dfs.sh”
    Warning: “WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable”.
    What to do to avoid this warning.

    • Nikunj Gupta says:

      Me too is getting the same error

      • Pritam says:

        This is not an error, it’s just a warning (you can ignore), Actually, it’s due to Hadoop setup was actually compiled with some other version of the platform. But it will actually use native-hadoop library.

  4. shivam rawat says:

    I am getting below given error after hdfs format:

    namenode.NameNode: Failed to start namenode.
    java.io.IOException: Cannot create directory /home/shivam_data/hdata/dfs/name/current
    at org.apache.hadoop.hdfs.server.com

    Kindly,help.

Leave a Reply

Your email address will not be published. Required fields are marked *