Setup Hadoop CDH3 on Ubuntu (Single-Node-Cluster)
This is a Step By Step Guide to Deploy Cloudera Hadoop (CDH3) on Ubuntu. In this tutorial, We will discuss setup and configuration of Cloudera Hadoop CDH3 on Ubuntu through virtual machine in Windows. Alternatively you can watch below Video tutorial “Setup Hadoop 1.x on Single Node Cluster on Ubuntu”
If you are looking to install Hadoop 2.x or Cloudera CDH5 follow this tutorial: Install Cloudera Hadoop CDH5 on Ubuntu
2. What is Hadoop
Hadoop is a framework that allows distributed processing of large data sets across clusters of computers using Map-Reduce programming models. It is designed to scale up from single servers to thousand of machines, each offering local computing and storage. Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big.
In this tutorial we are going to discus the installation of hadoop CDH3 in Pseudo-Distributed Mode:
Keeping you updated with latest technology trends, Join DataFlair on Telegram
The prerequisites for this tutorial is that you should have the latest Virtual Box (Oracle or VMWare) installed. We will be using the Ubuntu, so firstly we need to install Virtual machine on Windows and then install Ubuntu on virtual machine.
3.1. Virtual Machine creation (Optional)
This step is optional if you have Ubuntu machine / virtual machine ready, you can skip this step.
Create the reference virtual machine, with the following parameters. This configuration is according to VMWare virtual machine :
Standard x86-compatible or x86-64-compatible personal computer 500 MHz or faster CPU minimum (500 MHz recommended) Compatible processors include
3.1.1. Download Virtual environment
Download virtual machine environment setup. It is open source that means available freely VMware non commercial setup from Official Website of vmware by doing formal registration. Download link: https://my.vmware.com/web/vmware/downloads
3.1.2. Installation of Virtual Machine :
We can Install vmware setup in our windows machine, create VM and install Ubuntu
3.1.3. Compatible processors include
Intel: Celeron, Pentium II, Pentium III, Pentium 4, Pentium M (including computers with Centrino™ mobile technology), Xeon™ (including “Prestonia”), EM64T AMD™: Athlon™, Athlon MP, Athlon XP, Athlon 64, Duron™, Opteron™, Turion™ 6
3.1.4. Multiprocessor systems supported
AMD Opteron, AMD Athlon 64, AMD Turion 64, AMD Sempron, Intel EM64T; support for 64-bit guest operating systems is available only on the following specific versions of these processors:
AMD Athlon 64, revision D or later AMD Opteron, revision E or later AMD Turion 64, revision E or later AMD Sempron, 64-bit-capable revision D or later (experimental support) Intel EM64T VT-capable processors (experimental support)
512 MB minimum (for leaning purpose (in the production it needs much higher RAM)) You must have enough memory to run the host operating system, application on host OS, and memory required for guest OS.
3.1.6. Disk Drives
Guest operating systems can reside on physical disk partitions or in virtual disk files. In the production environment it is recommended that you should have more number of disk mount points using JBOD.
3.1.7. Hard Disk
IDE and SCSI hard drives supported, up to 950GB capacity At least 1GB free disk space recommended for each guest operating system and the application software used with it; if you use a default setup, the actual disk space needs are approximately the same as those for installing and running the guest operating system and applications on a physical computer.
3.1.8. Host Operating System
Vmware Workstation / Virtual Box is available for both Windows and Linux host operating systems. Windows Host Operating Systems (32Bit) / Windows Host Operating Systems (64Bit). Linux Host Operating Systems (32-Bit) / Linux Host Operating Systems (64-Bit)
If these professionals can make a switch to Big Data, so can you:
Java → Big Data Consultant, JDA
PeopleSoft → Big Data Architect, Hexaware
3.2. Install Ubuntu:
Now Virtual machine environment is ready now we can proceed to install Ubuntu in virtual machine
3.2.1. Download Ubuntu
Download Ubuntu Disc Image setup. It is open source so we can download it. (Ubuntu 14.04)
3.2.2. Ubuntu Installation on Virtual Machine :
You are just few step away from Ubuntu working in your virtual machine. Now follow the steps:
- Open the VMWare virtual machine and create a New Virtual Machine Now new pop out window appear
- Then select the second option that is Installer disc image file (iso). then drag your Ubuntu Disc Image Setup location from your system then click next
- Now new pop out window appear that will ask you fill up detail:
- Full Name, User Name, Password and finally Confirm password then click next
- Now you will gate one more pop out window that is location inside your system you want to installed your Ubuntu Disc Image Setup click next.
- Now you will gate one more pop out window here you need to select 1st option and click next.
- Now you will get final window that will signify Installation of Ubuntu Disc Image Setup on vmware virtual
Before the installation of Hadoop cdh3 on Ubuntu, we must have Java installed and configure ssh in Ubuntu. So we need to Install java and configure ssh in Ubuntu.
3.3. Install java:
3.3.1. Update the source list
$ sudo apt-get update # It will ask for password so enter password
Downloads the package lists from the repositories and “updates” them to get information on the newest versions of packages and their dependencies. so we need to updated the list by triggering following command on Ubuntu terminal
3.3.2. Install Open jdk:
Now install Java on ubutu by triggering following command on Ubuntu terminal
# Install Java 6 JDK $ sudo apt-get install openjdk-6-jdk
After installation, make a quick check whether JDK is correctly set up.
user@ubuntu:~# java -version java version "1.6.0_20"
3.4. Configure Password-less SSH:
Hadoop requires password-less SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it. For single-node setup of Hadoop, we need to configure SSH access to localhost
3.4.1. Install Open SSH Server-client:
NOTE: we are installing ssh in Ubuntu
$sudo apt-get install openssh-server openssh-client
3.4.2. Generate Key-Value Pairs:
We have to generate an SSH key value pair
$ssh-keygen -t rsa -P ""
3.4.3. Configure password-less SSH:
Copy the content of Public key to authorized_keys file
3.4.4. Check by SSH to localhost:
The final step is to test the SSH setup by connecting to your local machine
4. Install Hadoop:
Now finally we have install java in Ubuntu and configure ssh then we are ready to install Hadoop cdh3
4.1. Download Hadoop:
Here we are going to download Hadoop package and we can extract the package at any desire location
We can can download Hadoop cloudera tar file from archive.cloudera.com
Untar Tar ball:
After download we need to untar the package by triggering following command in ubuntu terminal
$tar xzf hadoop-0.20.2-cdh3u5.tar.gz
Go to HADOOP_HOME_DIR:
4.2. Setup Configuration:
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the JDK/JRE 6 directory.
export JAVA_HOME=path to be the root of your Java installation(eg: /usr/lib/jvm/java-6-sun)
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to.that means we are going to assign information of Data node and the port no through which they are going to access Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine. and we are going to assign localhost as machine value.
Edit configuration file conf/core-site.xml and add following entries:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/PATH/WHERE/YOU/HAVE/WRITE/PRIVILEGES</value> </property> </configuration>
In this section we are going to assign value dfs replication factor so we are going to edit configuration file conf/hdfs-site.xml and add following entries:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
In this section, we will configure information of Job Tracker and the port no through which it are going to access Our setup will use Hadoop’s Mapreduce. we can supply any no of job trakcer information in this file. so we are going to edit configuration file conf/mapred-site.xml and add following entries
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
4.3. Start The Cluster
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster”. You need to do this the first time you set up a Hadoop cluster.
Format the name node:
To format the filesystem (which simply initializes the directory specified by the dfs.name.dirvariable), run the command
$bin/hadoop namenode -format
Caution: This activity should be done once when you install hadoop, don’t format running hadoop File system as you will lose data currently in the HDFS Cluster.
Start Hadoop Services:
This will start up cluster with following daemons : Namenode, Datanode, Jobtracker, Tasktracker and Secondarynamenode on your machine.
Check whether services have been started :
To check whether single nod cluster is working or not run the following command on terminal:
$ jps 2287 TaskTracker 2149 JobTracker 1938 DataNode 2085 SecondaryNameNode 2349 Jps 1788 NameNode
4.4. Stop The Cluster:
Once your work is finished you can stop the cluster, to stop all the daemons running on your machine. Run the command
5. Related Links
- History of Big Data
- Apache Spark – 100x faster than Hadoop-MapReduce
- Top 10 Useful Hdfs Commands Part-I