Install Hadoop 2 with YARN in Pseudo-Distributed Mode 3


1. Objective

This guide contains very simple and easy to execute step by step documentation to install Hadoop 2 with Yarn on Ubuntu OS. Single Node Hadoop cluster means all master, as well as slave daemons, will run on the same machine. This quickstart briefs how to setup and deploy Hadoop very quickly, after installation you can play with Hadoop. Single Node Hadoop cluster is also called as Hadoop in Pseudo-Distributed Mode. After installing Hadoop you can store your data in HDFS and submit MapReduce and Other applications like Spark.

Learn how to install Hadoop 2 with YARN in pseudo distributed mode.

2. Steps to Install Hadoop 2 with YARN on Ubuntu

2.1. Platform Requirements

Operating system: Ubuntu 14.04 or later, we can also use other Linux flavors like CentOS, Redhat, etc.

Hadoop: We can use the latest version of Apache Hadoop 2.x or Cloudera CDH5.x

2.2. Configure & Setup Platform

If you are using Windows/Mac OS you can create virtual machine and install Ubuntu using VMWare Player, alternatively, you can create virtual machine and install Ubuntu using Oracle Virtual Box

2.3. Software you need to install before installing Hadoop

I. Install Java 8

a) Install Python Software Properties

Before adding java repositories we have to first download python-software-properties. In order to get python software properties installed, we need to download it and get it installed by executing below commands in a terminal:

 sudo apt-get install python-software-properties

NOTE: On executing the above command. We need to enter the password because in order to successfully install the Python software properties we need root privileges since we are using “sudo” command. In order to perform any installation or configuration of any software we always need root privileges.

b) Add Repository

Now we need to Add Repository. Surely a question had arrived in your mind that what is a Repository? The repository is simply a collection of list of software for a Linux distribution on a server from where we can easily get the software download and installed using “apt-get” command in Linux OS. So, in this step, we will add a repository manually so that Ubuntu can easily install the Java. To add a Java repositories we will use the following command:

 sudo add-apt-repository ppa:webupd8team/java

Now you need to Press [Enter] button to continue installation.

c) Update the source list

The source list is the location from where Ubuntu download and install the software and also the software gets updated here. It is a good practice to update the source list periodically if we are going to install a new package/software or getting any previously installed software updated. Doing so will make the changes get easily saved and while using the software we will not face any problem. To get the source list updated we need to execute the below command:

 sudo apt-get update

On Executing the above command source list of Ubuntu gets updated.

d) Install Java

As we know that the framework of Hadoop is written in Java. Hence it requires Java to be installed. So, let’s begin by installing Java. In order to install Java, firstly we need to download it and then we will be able to install the Java. So to download and install Java we need to use the below command:

 sudo apt-get install oracle-java8-installer

On executing this command Java gets automatically start downloading and gets installed.

To check whether installation procedure gets successfully completed and a completely working Java is installed or not and to know the version of Java installed we have to use the below command:

 java -version

II. Configure SSH

In order to get a login to remote machine secured shell (SSH) is used. Means we can login to a remote machine using SSH. SSH setup is required to perform different operations on a cluster such as starting, stopping, distributed daemon shell operations. So to work seamlessly, SSH needs to be setup to allow password-less login for the Hadoop user from machines in the cluster. So in order to allow password-less login for the Hadoop user from machines in the cluster, we need to configure password less SSH. Password-less SSH setup is required for remote script invocation. And automatically the daemons get started by the master remotely on slaves.

a) Install Open SSH Server-Client

In order to get a remote login, we need to use these SSH tools.

 sudo apt-get install openssh-server openssh-client

b) Generate Key Pairs

By this public and private key pair gets generated, which we can use for authentication.

 ssh-keygen -t rsa -P ""

Note: Now a message will appear to enter the file in which you want to save the key (/home/hdadmin/.ssh/id_rsa): we don’t need to specify any path just leave it as it is and press “Enter”. Doing this will make it available on the default path i.e. “.ssh”. The default path can be known by using command “$ls .ssh/” and we can see that two files are generated “id_rsa” and “id_rsa.pub”.  “id_rsa” is a private key and “id_rsa.pub” is a public key.

c) Configure passwordless SSH

Now we need to copy the contents of “id_rsa.pub” into the “authorized_keys” file. This can be done by executing the following command:

 cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

d) Check by SSH to localhost

 ssh localhost

As we have already configured passwordless SSH, Hence it will not ask for any password and we can get logged in to localhost very easily.

2.4. Install Hadoop

In order to get Hadoop-2.6.0-CDH-5.5.1 in pseudo distributed mode installed successfully and easily we need to follow the following steps:

I. Download Hadoop from the below link:

http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.5.1.tar.gz

We need to copy Hadoop to our Home directory after download process gets completed.

II. Untar Tar-ball

In order to extract all the contents of compressed Hadoop file package we will use the below command:

 tar xzf hadoop-2.6.0-cdh5.5.1.tar.gz

Note: HADOOP_HOME directory (the extracted directory is called as HADOOP_HOME. e.g. hadoop-2.6.0-cdh5.5.1) consist of all the libraries, scripts, configuration files, etc.

III. Setup Configuration

a) Edit .bashrc

Now we need to make changes in Hadoop environment file “.bashrc”, which is present in our system’s home directory. We can easily search this file in our home directory and can make changes in this file with the help of command: $ nano -/.bashrc” and we need to add the following parameters at the end of this file:

export HADOOP_PREFIX="/home/dataflair/hadoop-2.6.0-cdh5.5.1"
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}

Note: We should take care of entering the correct path. “/home/dataflair/hadoop-2.6.0-cdh5.5.1” this is the path of my home directory. You can know the path of your home directory by using “$pwd” command.

After the above parameters get added we need to save this file. Press “Ctrl+X” on your keyboard in order to save this file.

Note: After performing the above steps we need to restart the terminal in order to make environment variables effective.

b) Edit hadoop-env.sh

Now we need to make changes in Hadoop configuration file “hadoop-env.sh” which is present in configuration directory (HADOOP_HOME/etc/hadoop) and there we need to set JAVA_HOME:

dataflair@ubuntu:~$cd hadoop-2.6.0-cdh5.5.1/
dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1$ cd etc/hadoop
dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano hadoop-env.sh

On this file change the path of JAVA_HOME as:

export JAVA_HOME=<path-to-the-root-of-your-Java-installation> (eg: /usr/lib/jvm/java-8-oracle/)

After changing the path of JAVA_HOME save this file. Press “Ctrl+X” on your keyboard in order to save this file.

Note: “/usr/lib/jvm/java-8-oracle/” is the default path of Java on your system. If you had made changes in your java path then enter the new java path here.

c) Edit core-site.xml

Now we need to make changes in Hadoop configuration file core-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command:

dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano core-site.xml

And there add below entries between <configuration> </configuration> which is present at the end of this file:

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/dataflair/hdata</value>
</property>

Location of namenode is specified by fs.defaultFS property; we will find the namenode running at 9000 port on localhost. We use hadoop.tmp.dir property to specify the location where temporary as well as permanent data of Hadoop will be stored.

Note: We should take care of entering the correct path “/home/dataflair/hdata” is my location; here you need to specify a location where you have Read Write privileges.

After adding the above parameters save this file. Press “Ctrl+X” on your keyboard in order to save this file, if you are using nano editor.

d) Edit hdfs-site.xml

Now we need to make changes in Hadoop configuration file hdfs-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command:

dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano hdfs-site.xml

And there add below entries between <configuration> </configuration> which is present at the end of this file:

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Replication factor is specified by dfs.replication property; as it is a single node cluster hence we will set replication to 1.

After adding the above parameters save this file. Press “Ctrl+X” on your keyboard in order to save this file.

e) Edit mapred-site.xml

Now we need to make changes in Hadoop configuration file mapred-site.xml (which is located in HADOOP_HOME/etc/hadoop)

Note: In order to edit mapred-site.xml file we need to first create a copy of file mapred-site.xml.template. A copy of this file can be created using the following command:

dataflair@ubuntu:~/ hadoop-2.6.0-cdh5.5.1/etc/hadoop$ cp mapred-site.xml.template mapred-site.xml

We will now edit the mapred-site.xml file by using the following command:

dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano mapred-site.xml

And there add below entries between <configuration> </configuration> which is present at the end of this file:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

In order to specify which framework should be used for MapReduce, we use mapreduce.framework.name property, yarn is used here.

f) Edit yarn-site.xml

Now we need to make changes in Hadoop configuration file yarn-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command:

dataflair@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano yarn-site.xml

And there add below entries between <configuration> </configuration> which is present at the end of this file:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

In order to specify auxiliary service need to run with nodemanager “yarn.nodemanager.aux-services” property is used. Here Shuffling is used as auxiliary service. And in order to know the class that should be used for shuffling we user “yarn.nodemanager.aux-services.mapreduce.shuffle.class”.

IV. Start the Hadoop Cluster

a) Format the namenode

dataflair@ubuntu:~$ hdfs namenode -format

NOTE: After installing we need to perform this activity only once if this command is again used that all the data will get deleted and namenode would be assigned some other namespaceId. If namenode and datanode Id’s doesn’t get matched, then the datanode daemons will not start while starting the daemons.

b) Start HDFS Daemons

dataflair@ubuntu:~$ start-dfs.sh

c) Start YARN Daemons

dataflair@ubuntu:~$ start-yarn.sh

d) Check the status of services

dataflair@ubuntu:~$ jps
NameNode
DataNode
ResourceManager
NodeManager

If the above output gets appeared after running jps command, that means you had successfully installed Hadoop. HDFS and MapReduce operations can be performed.

See Also-

 


Leave a comment

Your email address will not be published. Required fields are marked *

3 thoughts on “Install Hadoop 2 with YARN in Pseudo-Distributed Mode

  • Vandana Thakkar

    can you please share the steps to install pig and hive on top of these.. and want to know is 4GB RAM enough for all three?

  • Apple

    Hi,
    Thanks for providing this document. It helped me a lot to install in my Ubuntu. I followed all the instruction that you provided. But I’m getting this warning for “start-dfs.sh”
    Warning: “WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable”.
    What to do to avoid this warning.