Setup Hadoop CDH3 on Ubuntu (Single-Node-Cluster)

Boost your career with Free Big Data Courses!!

1. Objective

This is a Step By Step Guide to Deploy Cloudera Hadoop (CDH3) on Ubuntu. In this tutorial, We will discuss setup and configuration of Cloudera Hadoop CDH3 on Ubuntu through virtual machine in Windows. Alternatively you can watch below Video tutorial “Setup Hadoop 1.x on Single Node Cluster on Ubuntu”

If you are looking to install Hadoop 2.x or Cloudera CDH5 follow this tutorial: Install Cloudera Hadoop CDH5 on Ubuntu

Setup Hadoop CDH3 on Ubuntu (Single-Node-Cluster)

2. What is Hadoop

Apache Hadoop

Hadoop is a framework that allows distributed processing of large data sets across clusters of computers using Map-Reduce programming models. It is designed to scale up from single servers to thousand of machines, each offering local computing and storage. Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big.
In this tutorial we are going to discus the installation of hadoop CDH3 in Pseudo-Distributed Mode:

3. Prerequisites

The prerequisites for this tutorial is that you should have the latest Virtual Box (Oracle or VMWare) installed. We will be using the Ubuntu, so firstly we need to install Virtual machine on Windows and then install Ubuntu on virtual machine.

3.1. Virtual Machine creation (Optional)

This step is optional if you have Ubuntu machine / virtual machine ready, you can skip this step.
Create the reference virtual machine, with the following parameters. This configuration is according to VMWare virtual machine :
Standard x86-compatible or x86-64-compatible personal computer 500 MHz or faster CPU minimum (500 MHz recommended) Compatible processors include

3.1.1. Download Virtual environment

Download virtual machine environment setup. It is open source that means available freely VMware non commercial setup from Official Website of vmware by doing formal registration. Download link: https://my.vmware.com/web/vmware/downloads

3.1.2. Installation of Virtual Machine :

We can Install vmware setup in our windows machine, create VM and install Ubuntu

3.1.3. Compatible processors include

Intel: Celeron, Pentium II, Pentium III, Pentium 4, Pentium M (including computers with Centrino™ mobile technology), Xeon™ (including “Prestonia”), EM64T AMD™: Athlon™, Athlon MP, Athlon XP, Athlon 64, Duron™, Opteron™, Turion™ 6

3.1.4. Multiprocessor systems supported

AMD Opteron, AMD Athlon 64, AMD Turion 64, AMD Sempron, Intel EM64T; support for 64-bit guest operating systems is available only on the following specific versions of these processors:
AMD Athlon 64, revision D or later AMD Opteron, revision E or later AMD Turion 64, revision E or later AMD Sempron, 64-bit-capable revision D or later (experimental support) Intel EM64T VT-capable processors (experimental support)

3.1.5. Memory

512 MB minimum (for leaning purpose (in the production it needs much higher RAM)) You must have enough memory to run the host operating system, application on host OS, and memory required for guest OS.

3.1.6. Disk Drives

Guest operating systems can reside on physical disk partitions or in virtual disk files. In the production environment it is recommended that you should have more number of disk mount points using JBOD.

3.1.7. Hard Disk

IDE and SCSI hard drives supported, up to 950GB capacity At least 1GB free disk space recommended for each guest operating system and the application software used with it; if you use a default setup, the actual disk space needs are approximately the same as those for installing and running the guest operating system and applications on a physical computer.

3.1.8. Host Operating System

Vmware Workstation / Virtual Box is available for both Windows and Linux host operating systems. Windows Host Operating Systems (32Bit) / Windows Host Operating Systems (64Bit). Linux Host Operating Systems (32-Bit) / Linux Host Operating Systems (64-Bit)

3.2. Install Ubuntu:

Now Virtual machine environment is ready now we can proceed to install Ubuntu in virtual machine

3.2.1. Download Ubuntu

Download Ubuntu Disc Image setup. It is open source so we can download it. (Ubuntu 14.04)
http://releases.ubuntu.com/14.04.2/ubuntu-14.04.2-desktop-i386.iso

3.2.2. Ubuntu Installation on Virtual Machine :

You are just few step away from Ubuntu working in your virtual machine. Now follow the steps:

Open the VMWare virtual machine and create a New Virtual Machine Now new pop out window appear
Then select the second option that is Installer disc image file (iso). then drag your Ubuntu Disc Image Setup location from your system then click next
Now new pop out window appear that will ask you fill up detail:
Full Name, User Name, Password and finally Confirm password then click next
Now you will gate one more pop out window that is location inside your system you want to installed your Ubuntu Disc Image Setup click next.
Now you will gate one more pop out window here you need to select 1st option and click next.
Now you will get final window that will signify Installation of Ubuntu Disc Image Setup on vmware virtual

Before the installation of Hadoop cdh3 on Ubuntu, we must have Java installed and configure ssh in Ubuntu. So we need to Install java and configure ssh in Ubuntu.

3.3. Install java:

3.3.1. Update the source list

[php]$ sudo apt-get update
# It will ask for password so enter password[/php]
Downloads the package lists from the repositories and “updates” them to get information on the newest versions of packages and their dependencies. so we need to updated the list by triggering following command on Ubuntu terminal

3.3.2. Install Open jdk:

Now install Java on ubutu by triggering following command on Ubuntu terminal
[php]# Install Java 6 JDK
$ sudo apt-get install openjdk-6-jdk[/php]
After installation, make a quick check whether JDK is correctly set up.
[php]user@ubuntu:~# java -version
java version “1.6.0_20”
[/php]

3.4. Configure Password-less SSH:

Hadoop requires password-less SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. For single-node setup of Hadoop, we need to configure SSH access to localhost

3.4.1. Install Open SSH Server-client:

NOTE: we are installing ssh in Ubuntu
[php]$sudo apt-get install openssh-server openssh-client[/php]

3.4.2. Generate Key-Value Pairs:

We have to generate an SSH key value pair
[php]$ssh-keygen -t rsa -P “”[/php]

3.4.3. Configure password-less SSH:

Copy the content of Public key to authorized_keys file
[php]$cat $HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys[/php]

3.4.4. Check by SSH to localhost:

The final step is to test the SSH setup by connecting to your local machine
[php]$ssh localhost[/php]

4. Install Hadoop:

Now finally we have install java in Ubuntu and configure ssh then we are ready to install Hadoop cdh3

4.1. Download Hadoop:

Here we are going to download Hadoop package and we can extract the package at any desire location
We can can download Hadoop cloudera tar file from archive.cloudera.com
http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u5.tar.gz
Untar Tar ball:
After download we need to untar the package by triggering following command in ubuntu terminal
[php]$tar xzf hadoop-0.20.2-cdh3u5.tar.gz[/php]
Go to HADOOP_HOME_DIR:
[php]$cd hadoop-0.20.2-cdh3u5/[/php]

4.2. Setup Configuration:

hadoop-env.sh:
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the JDK/JRE 6 directory.
[php]export JAVA_HOME=path to be the root of your Java installation(eg: /usr/lib/jvm/java-6-sun)[/php]
core-site.xml
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to.that means we are going to assign information of Data node and the port no through which they are going to access Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine. and we are going to assign localhost as machine value.
Edit configuration file conf/core-site.xml and add following entries:
[php]<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/PATH/WHERE/YOU/HAVE/WRITE/PRIVILEGES</value>
</property>
</configuration>[/php]
hdfs-site.xml:
In this section we are going to assign value dfs replication factor so we are going to edit configuration file conf/hdfs-site.xml and add following entries:
[php]<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>[/php]
mapred-site.xml
In this section, we will configure information of Job Tracker and the port no through which it are going to access Our setup will use Hadoop’s Mapreduce. we can supply any no of job trakcer information in this file. so we are going to edit configuration file conf/mapred-site.xml and add following entries
[php]<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>[/php]

4.3. Start The Cluster

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster”. You need to do this the first time you set up a Hadoop cluster.
Format the name node:
To format the filesystem (which simply initializes the directory specified by the dfs.name.dirvariable), run the command
[php]$bin/hadoop namenode -format[/php]

Caution: This activity should be done once when you install hadoop, don’t format running hadoop File system as you will lose data currently in the HDFS Cluster.

Start Hadoop Services:
[php]$ bin/start-all.sh[/php]
This will start up cluster with following daemons : Namenode, Datanode, Jobtracker, Tasktracker and Secondarynamenode on your machine.
Check whether services have been started :
To check whether single nod cluster is working or not run the following command on terminal:
[php]$ jps
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
1788 NameNode[/php]

4.4. Stop The Cluster:

Once your work is finished you can stop the cluster, to stop all the daemons running on your machine. Run the command
[php]$ bin/stop-all.sh [/php]

5. Related Links

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google

ariel says:
August 20, 2016 at 4:45 pm
Hello there, just became aware of your blog through Google, and found that it is really informative.
I’m gonna watch out for brussels. I will be grateful if you continue this in future.
A lot of people will be benefited from your writing. Cheers!
kjjiij says:
August 23, 2016 at 9:57 am
Hey there! I’ve been following your weblog for some time now and finally got the bravery to go ahead and give you a shout out from Austin Texas! Just wanted to say keep up the excellent job!
eiyruruxo says:
August 25, 2016 at 7:09 am
nice post, I have successfully installed Hadoop in distributed mode….. Thank You.
Lokesh says:
October 7, 2016 at 2:59 pm
Thanks for the hadoop installation instructions on ubuntu, but how to install Hadoop on CentOS or Redhat ?
can we use Java 8 with Hadoop ?
jordans says:
October 7, 2016 at 10:26 pm
I have successfully installed hadoop using your tutorial, thanks, how to install it on multi node cluster ?
how to store data in hdfs, also how to submit MapReduce job ? please help I am a newbie in Hadoop
tiffany says:
November 29, 2016 at 1:01 am
recommendable goods! I have understand this guidance stuff of the latest to and truly’re is extremely excellent. This is really a great site:
nike says:
December 1, 2016 at 3:11 pm
you are a wonderful blogger and has very nicely explained single node cluster setup in Hadoop
LanceJCotman says:
December 1, 2016 at 9:41 pm
This is very interesting, You are a very skilled blogger.
I have joined your feed and look forward to seeking more of
your magnificent post. Also, I have shared your web site in my social networks!
Ejszho says:
December 4, 2016 at 2:52 am
Nice content on how to set Hadoop on Ubuntu.
Liara says:
December 12, 2016 at 11:49 pm
Great blog!!
abc says:
June 12, 2018 at 5:18 am
only secondarynamenode and jps are appearing , command line saying namenode and remaining are not found (when i run jps)

Setup Hadoop CDH3 on Ubuntu (Single-Node-Cluster)

1. Objective

2. What is Hadoop

3. Prerequisites

3.1. Virtual Machine creation (Optional)

3.1.1. Download Virtual environment

3.1.2. Installation of Virtual Machine :

3.1.3. Compatible processors include

3.1.4. Multiprocessor systems supported

3.1.5. Memory

3.1.6. Disk Drives

3.1.7. Hard Disk

3.1.8. Host Operating System

3.2. Install Ubuntu:

3.2.1. Download Ubuntu

3.2.2. Ubuntu Installation on Virtual Machine :

3.3. Install java:

3.3.1. Update the source list

3.3.2. Install Open jdk:

3.4. Configure Password-less SSH:

3.4.1. Install Open SSH Server-client:

3.4.2. Generate Key-Value Pairs:

3.4.3. Configure password-less SSH:

3.4.4. Check by SSH to localhost:

4. Install Hadoop:

4.1. Download Hadoop:

4.2. Setup Configuration:

4.3. Start The Cluster

4.4. Stop The Cluster:

5. Related Links

No Responses

Leave a Reply Cancel reply