Install Pivotal Hadoop v-2 Cluster in production

Boost your career with Data Engineering Courses!!

1. Introduction

Install Hadoop (HDFS, YARN), Pig, Hive, HBase, Mahout, Zookeeper with production configurations.

Deploy Pivotal Hadoop 2 Cluster with Pivotal Command centre and icm_client. PCC is the deployment, monitoring, management, maintenance tool which comes from Pivotal (EMC). The installation steps mentioned below is for the cluster we deployed in the production. We will setup namenode high-availability, so that if namenode crashes standby will take over automatically. Following installation will guide you for installation of Hadoop (HDFS, YARN), Pig, Hive, HBase, Mahout, Zookeeper.

Install Pivotal Hadoop v-2 Cluster in production

2. Recommended Platform

During the installation we are using following platform:
A) CentOS 6.4 (You can use RHEL or other versions as well)
B) 14 node cluster with following details:

1 master (or namenode) (it should be of higher configuration than slaves it is recommended to have more RAM (as it keeps all the meta-data in the memory), although less disk space is required on master)
1 standby-master (configuration of master and standby-master must be same, when master crashes this standy-master will automatically become master)
1 client node (where all the clients like Pig, Hive, HBase and Mahout libraries will be installed, we can access complete components (Hadoop and all the sub-projects of Hadoop) of the cluster from this client node. This would be useful also when you want to share your cluster with other teams, you do not need to share master instead of that share this client machine with them)
10 slaves (DataNodes) (these servers can of less configuration (less RAM / CPU, but should have multiple disks with high capacity))
1 admin node (we will install PCC (Pivotal command centre) and ICM (installation and configuration Manager) on this machine)

C) PHD-2.X

3. Prerequisite:

Apache Hadoop

3.1. Install Java

Ensure that you are running Oracle JAVA JDK version 1.7 on the Admin node. Version 1.7 is required; version 1.7u15 is recommended.
If you are not running the correct JDK, download a supported version from the Oracle site at http://www.oracle.com/technetwork/java/javase/downloads/index.html
[php]/usr/java/default/bin/java –version[/php]
The output of this command should contain 1.7 (version number) and JavaHotSpot(TM) (Java version). For example:
[php]java version “1.7.0_45″Java(TM) SE Runtime Environment (build 1.7.0_45-b18)Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)[/php]
NOTE: Make sure you are not running OpenJDK as your default JDK. If you are running OpenJDK, we recommend you to remove it.

3.2. Update hosts file

Add entry of all the machines in the hosts file of all the nodes. Edit the hosts file (nano /etc/hosts) and add following entries:
[php]admin.domain.com admin
client.domain.com client
hdnn1.domain.com hdnn1
hdnn2.domain.com hdnn2
hddn1.domain.com hddn1
(do this for all the slaves ie hddn1, hddn2, hddn3…. hddn10)[/php]

3.3. Change system parameters

3.3.1. Change hostname

If you have hostname in capital letters, change the hostname:
Avoid using hostnames that contain capital letters because Puppet has an issue generating certificates for domains with capital letters. Also avoid using underscores as they are invalid characters in hostnames.
[php]nano /etc/sysconfig/networkHOSTNAME=admin.domain.com[/php]
NOTE: Do this activity for all the machines in the cluster, make sure you provide corresponding hostname in the /etc/sysconfig/network file

3.3.2. Disable SELinux:

[php]sestatus[/php]
But if the output is
[php]sestatus
SELinux status: enabled
SELinuxfs mount: /selinux
Current mode: enforcing
Mode from config file: enforcing
Policy version: 24
Policy from config file: targeted[/php]
Then edit /etc/selinux/config and add following entries:
[php]nano /etc/selinux/configSELINUX=disabled[/php]

3.3.3. Update Operating System Parameters

Add following configuration parameter of Operating system to get best performance:

3.3.3.1. Edit /etc/sysconfig/cpuspeed

Edit /etc/sysconfig/cpuspeed and add following parameter:
[php]nano /etc/sysconfig/cpuspeedGOVERNOR=performance[/php]

3.3.3.2. Stop cpuspeed service

[php]service cpuspeed stopDisabling ondemand cpu frequency scaling: [ OK ][/php]

3.3.3.3. Stop iptables service

[php]service iptables stopiptables: Firewall is not running.[/php]

3.3.3.4. Stop ip6tables service

[php]service ip6tables stop[/php]

3.3.3.5. Stop iptables

[php]chkconfig iptables off[/php]

3.3.3.6. Stop ip6tables

[php]chkconfig ip6tables off[/php]

3.3.3.7. Stop cpuspeed

[php]chkconfig cpuspeed off[/php]

3.3.3.8. Edit /etc/sysctl.conf (OPTIONAL)

Edit /etc/sysctl.conf and add following parameters
[php]nano /etc/sysctl.conf
# COMMENT FOLLOWING PARAMETERES:
# Disable netfilter on bridges.
#net.bridge.bridge-nf-call-ip6tables = 0
#net.bridge.bridge-nf-call-iptables = 0
#net.bridge.bridge-nf-call-arptables = 0# ADD FOLLOWING PARAMETERS:kernel.shmmax = 500000000
kernel.shmmni = 4096
kernel.shmall = 4000000000
kernel.sem = 250 512000 100 2048
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048
net.ipv4.tcp_syncookies = 1
net.ipv4.ip_forward = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.ipv4.ip_local_port_range = 1025 65535
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
vm.overcommit_memory = 2[/php]

3.3.3.9. Edit /etc/security/limits.conf

Edit /etc/security/limits.conf and add following parameter
[php]nano /etc/security/limits.conf
# COMMENT FOLLOWING PARAMETERS:
#* soft nofile 65536
#* hard nofile 65536
#* soft nproc 131072
#* hard nproc 131072[/php]

3.3.3.10. Edit /etc/security/limits.d/90-nproc.conf

Edit /etc/security/limits.d/90-nproc.conf and add following parameter
[php]nano /etc/security/limits.d/90-nproc.conf
# COMMENT FOLLOWING PARAMETERES
#root soft nproc unlimited# ADD FOLLOWING PARAMETERS:
* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072[/php]

3.3.3.11. Edit /etc/rc.d/rc.local

Edit /etc/rc.d/rc.local and add following parameter
[php]nano /etc/rc.d/rc.local
blockdev –setra 16384 /dev/sd*[/php]

3.4. Install Following packages

[php]yum install httpd mod_ssl postgresql postgresql-devel postgresql-server postgresql-jdbc createrepo sigar sudo[/php]

4. Install PCC (Pivotal Command Centre)

NOTE: Configure FQDN on admin node by adding FQDN in /etc/hosts
[php]cat /etc/hosts | grep admin
admin.domain.com admin
$hostname
admin $hostname -f
admin.domain.com[/php]
Untar PCC:
[php]tar -xvf PCC-2.2.1-150.x86_64.tar[/php]
Install Pivotal Command Centre:
[php]./install[/php]
NOTE:
[php]facter –puppet fqdn (output should not be blank)[/php]
Check whether the commander has started:
[php]service commander status[/php]

5. Deploy Pivotal Hadoop

NOTE: RUN FOLLOWING COMMANDS AS “gpadmin” USER
Untar Hadoop:
[php]tar xzf PHD-2.0.1.0-148.tar[/php]
Import the following tarballs for Pivotal HD:
[php]icm_client import -s PHD-2.0.1.0-148[/php]
Fetch the default cluster Configuration template:
[php]icm_client fetch-template -o ~/ClusterConfigDir[/php]
Edit clusterConfig.xml and set all the required parameters:
[php]nano clusterConfig.xml
# It should look like:
<?xml version=”1.0″ encoding=”UTF-8″?>
<clusterConfig>
<!– Cluster Name. This will be used on command lines, so avoid spaces and special characters. –>
<clusterName>my-hd-cluster</clusterName>
<gphdStackVer>PHD-2.0.1.0</gphdStackVer>
<!– Comma separated list. Choices are: hdfs,yarn,zookeeper,hbase,hive,hawq,gpxf,pig,mahout –>
<!– services not meant to be installed can be deleted from the following list. –>
<!– <services>hdfs,yarn,zookeeper,hbase,hive,hawq,gpxf,pig,mahout</services> –>
<services>hdfs,yarn,zookeeper,hbase,hive,pig,mahout</services>
<!– Hostname of the machine which will be used as the client –>
<!– ICM will install Pig, Hive, HBase and Mahout libraries on this machine. –>
<client>phdnn1.domain.com</client>
<!– Comma separated list of fully qualified hostnames should be associated with each service role –>
<!– services (and corresponding roles) not meant to be installed can be deleted from the following list –>
<hostRoleMapping>
<hdfs>
<!– HDFS, NameNode role (mandatory) –>
<namenode>phdnn1.domain.com</namenode>
<!– HDFS, DataNode role (mandatory) ** list of one or more, comma separated,
names must be resolvable ** –>
<datanode>phddn1.domain.com,phddn2.domain.com,phddn3.domain.com,
phddn4.domain.com,phddn5.domain.com,phddn6.domain.com,phddn7.domain.com,
phddn8.domain.com,phddn9.domain.com,phddn10.domain.com</datanode>
<!– HDFS, Secondary NameNode role (optional) – ** comment out XML tag if not
required ** –>
<!– <secondarynamenode>host.yourdomain.com</secondarynamenode> –>
<!– HDFS, Standby NameNode that acts as a failover when HA is enabled –>
<standbynamenode>phdnn2.domain.com</standbynamenode>
<!–specify quorum journal nodes used as common storage by both active and
stand-by namenode. list of one or more comma separated –>
<journalnode>phdnn1.domain.com,phdnn2.domain.com,
phddn1.domain.com</journalnode>
</hdfs>
<yarn>
<!– YARN, ResourceManager role (mandatory) –>
<yarn-resourcemanager>phdnn1.domain.com</yarn-resourcemanager>
<!– YARN, NodeManager role (mandatory) ** list of one or more,
comma separated, generally takes on the DataNode element ** –>
<yarn-nodemanager>phddn1.domain.com,phddn2.domain.com,phddn3.domain.com,
phddn4.domain.com,phddn5.domain.com,phddn6.domain.com,phddn7.domain.com,
phddn8.domain.com,phddn9.domain.com,phddn10.domain.com</yarn-nodemanager>
<!– Generally runs on the ResourceManager (mandatory) –>
<mapreduce-historyserver>phdnn1.domain.com</mapreduce-historyserver>
</yarn>
<zookeeper>
<!– ZOOKEEPER, zookeeper-server role (required if using HBase) **
list an *ODD NUMBER*, comma separated ** –>
<!– Options to run zookeepers: Master nodes like Secondary NameNode,
ResourceManager, HAWQ Standby node,etc EXCEPT NameNode–>
<zookeeper-server>phdnn1.domain.com,phdnn2.domain.com,
phddn1.domain.com</zookeeper-server>
</zookeeper>
<hbase>
<!– HBASE, master role –>
<hbase-master>phdnn1.domain.com</hbase-master>
<!– HBASE, region server role ** list of one or more, comma separated ** –>
<!– HBase region servers are generally deployed on the DataNode hosts –>
<hbase-regionserver>phddn1.domain.com,phddn2.domain.com,phddn3.domain.com,
phddn4.domain.com,phddn5.domain.com,phddn6.domain.com,phddn7.domain.com,
phddn8.domain.com,phddn9.domain.com,phddn10.domain.com</hbase-regionserver>
</hbase>
<hive>
<!– HIVE, server role –>
<!– Hive thrift server would be running on this machine. –>
<hive-server>phdnn1.domain.com</hive-server>
<!– HIVE, metastore role. This host will run an instance of PostgreSQL. –>
<!– Postgres database is installed on this node to maintain Hive metadata. –>
<hive-metastore>phdnn1.domain.com</hive-metastore>
</hive>
<hawq>
<!– HAWQ, hawq master role (mandatory). This should be a different host
than the NameNode host. –>
<!– <hawq-master>host.yourdomain.com</hawq-master> –>
<!– HAWQ, hawq standbymaster role (optional) – ** comment out XML tag if not required **–>
<!– <hawq-standbymaster>host.yourdomain.com</hawq-standbymaster> –>
<!– HAWQ, hawq segments (mandatory) ** list of one or more, comma separated ** –>
<!– In many configurations, this will be the same list as the DataNode role. –>
<!– <hawq-segment>host.yourdomain.com</hawq-segment> –>
</hawq>
</hostRoleMapping>
<servicesConfigGlobals>
<!– List of one or more, comma separated, disk mount points on slave
nodes (datanode/nodemanager/regionservers) –>
<datanode.disk.mount.points>/data1/hdata,/data2/hdata,/data3/hdata,/data4/hdata,
/data5/hdata,/data6/hdata</datanode.disk.mount.points>
<!– List of one or more, comma separated, disk mount points on NameNode. More than one can be configured for NameNode local data redundancy. –>
<namenode.disk.mount.points>/data1/hdata</namenode.disk.mount.points>
<!– List of one or more, comma separated disk mount points on secondary-namenode. More than one can be configured for secondary-namenode local data redundancy. –>
<!– <secondary.namenode.disk.mount.points>/data/secondary_nn
</secondary.namenode.disk.mount.points> –>
<!– Max. memory on nodemanager collectively available to all the task containers
including application master. –>
<yarn.nodemanager.resource.memory-mb>49152</yarn.nodemanager.resource.memory-mb>
<!– Min memory required for the task or application manager container –>
<yarn.scheduler.minimum-allocation-mb>2048</yarn.scheduler.minimum-allocation-mb>
<!– THIS IS THE “NameNode port”. –>
<dfs.port>8020</dfs.port>
<!– This syntax correlates to a bash array data structure. There are no commas seperating the values. It’s just whitespace. If you enter multiple entries with the exact same value; each entry will result in a HAWQ segment placed on a host. If you want two segments per hosts make two entries with the same values; which may or may not be placed on the same disk within the host. WHICH MAY OR MAY NOT BE ON THE SAME DISK. BY CONVENTION, THE LAST PART OF THIS STRING IS In the example below, two segments will be created on every HAWQ segment host –>
<!– <hawq.segment.directory>(/data1/primary /data1/primary)</hawq.segment.directory> –>
<!– THIS IS A SINGLE VALUE, AND IDEALLY THE DIRECTORY SPECIFIED
CORRELATES WITH A RAID VOLUME, AS THE
MASTER DATA DIRECTORY CONTAINS IMPORTANT METADATA. –>
<!– <hawq.master.directory>/data1/master</hawq.master.directory> –>
<!– IDEALLY, THIS DIRECTORY IS ON ITS OWN DEVICE, TO MINIMIZE I/O
CONTENTION BETWEEN ZOOKEEPER
AND OTHER PROCESSES. –>
<zookeeper.data.dir>/data1/hdata/zookeeper</zookeeper.data.dir>
<!– Zookeeper Client Port –>
<zookeeper.client.port>2181</zookeeper.client.port>
<!– JAVA_HOME defaults to this. Update if you have java installed elsewhere –>
<cluster_java_home>/usr/java/latest/</cluster_java_home>
<!–maximum amount of HEAP to use. Default is 1024–>
<dfs.namenode.heapsize.mb>4096</dfs.namenode.heapsize.mb>
<dfs.datanode.heapsize.mb>4096</dfs.datanode.heapsize.mb>
<yarn.resourcemanager.heapsize.mb>4096</yarn.resourcemanager.heapsize.mb>
<yarn.nodemanager.heapsize.mb>4096</yarn.nodemanager.heapsize.mb>
<hbase.heapsize.mb>4096</hbase.heapsize.mb>
<dfs.datanode.failed.volumes.tolerated>0</dfs.datanode.failed.volumes.tolerated>
<!– Choose a logical name for HA nameservice, for example “test”. The name
you choose will be used both for configuration
and as the authority component of absolute HDFS paths in the cluster –>
<nameservices>my-hd-cluster</nameservices>
<!– Choose ids for the two namenodes being used. These ids will be used in the configuration of properties related to these namenodes –>
<namenode1id>nn1</namenode1id>
<namenode2id>nn2</namenode2id>
<!–specify the path on quorum journal nodes where the shared data is written –>
<journalpath>/data1/hdata/qjournal</journalpath>
<!–specify the port where quorum journal nodes should be run–>
<journalport>8485</journalport>
</servicesConfigGlobals>
</clusterConfig>[/php]
Check whether the xml formatting is correct:
[php]xmlwf clusterConfig.xml[/php]

5.1. Enable namenode high-availability

To enable HA, you then need to make HA-specific edits to the following configuration files:

clusterConfig.xml
hdfs/hdfs-site.xml
hdfs/core-site.xml
hbase/hbase-site.xml
yarn/yarn-site.xml

NOTE: When specifying the nameservices in the clusterConfig.xml, do not use underscores (‘_’), for example, phd_cluster
a. Edit clusterConfig.xml as follows:

Comment out secondarynamenode role in hdfs service
Uncomment standbynamenode and journalnode roles in hdfs service
Uncomment nameservices, namenode1id, namenode2id, journalpath, and journalport entries in serviceConfigGlobals

b. Edit hdfs/hdfs-site.xml as follows:

Uncomment the following properties:

[php]<property>
<name>dfs.nameservices</name>
<value>${nameservices}</value>
</property><property>
<name>dfs.ha.namenodes.${nameservices}</name>
<value>${namenode1id},${namenode2id}</value>
</property><property>
<name>dfs.namenode.rpc-address.${nameservices}.${namenode1id}</name>
<value>${namenode}:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.${nameservices}.${namenode2id}</name>
<value>${standbynamenode}:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.${nameservices}.${namenode1id}</name>
<value>${namenode}:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.${nameservices}.${namenode2id}</name>
<value>${standbynamenode}:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://${journalnode}/${nameservices}</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.${nameservices}</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence
shell(/bin/true)
</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hdfs/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>${journalpath}</value>
</property>
<!– Namenode Auto HA related properties –>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!– END Namenode Auto HA related properties –>[/php]

Comment the following properties:

[php]dfs.namenode.secondary.http-address
${secondarynamenode}:50090
The secondary namenode http server address and port.[/php]
c. Edit yarn/yarn-site.xml:
[php]mapreduce.job.hdfs-servers
hdfs://${nameservices}[/php]
d. Edit hdfs/core-site.xml as follows:

Set the following property key value:

[php]fs.defaultFS
hdfs://${nameservices}
The name of the default file system. A URI whosescheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.[/php]

Uncomment following property:

[php]ha.zookeeper.quorum
${zookeeper-server}:${zookeeper.client.port}[/php]
e. Edit hbase/hbase-site.xml as follows:
[php]hbase.rootdir
hdfs://${nameservices}/apps/hbase/data
The directory shared by region servers and into which HBase persists. The URL should be ‘fully-qualified’ to include the filesystem scheme. For example, to specify the HDFS directory ‘/hbase’ where the HDFS instance’s namenode is running at namenode.example.org on port 9000, set this value to: hdfs://namenode.example.org:9000/hbase. By default HBase writes into /tmp. Change this configuration else all data will be lost on machine restart.[/php]

6. Deploy the Cluster

Pivotal HD deploys clusters using input from the cluster configuration directory. This cluster configuration directory contains files that describes the topology and configuration for the cluster.
NOTE: Deploy the cluster as gpadmin.
The deploy command internally does following steps:
I. Prepares the cluster nodes with the prerequisites (internally runs preparehosts command)
a) Creates the gpadmin user.
b) As gpadmin, sets up password-less SSH access from the Admin node.
c) Installs the provided Oracle Java JDK.
d) Disables SELinux across the cluster.
e) Optionally synchronizes the system clocks.
f) Installs Puppet version 2.7.20 (the one shipped with the PCC tarball, not the one from puppetlabs repo)
g) Installs sshpass.
II. Verifies the prerequisites (internally runs scanhosts command)
III. Deploys the cluster

6.1. Troubleshooting

You can check the following log files to troubleshoot any failures:
On Admin:
/var/log/gphd/gphdmgr/GPHDClusterInstaller_XXX.log
/var/log/gphd/gphdmgr/gphdmgr-webservices.log
/var/log/messages
/var/log/gphd/gphdmgr/installer.log
On Cluster Nodes:
/tmp/GPHDNodeInstaller_XXX.log
Deploy Hadoop Cluster
[php]icm_client deploy -c ClusterConfigDir/[/php]

6.2. Start Hadoop Cluster

[php]icm_client start -l my-hd-cluster[/php]
Now to check whether cluster is working correctly:
Login to web-console:
[php]https://:5443
UserName: gpadmin
Password: Gpadmin1[/php]

6.3. Run sample program:

[php]# create input directory
hadoop fs -mkdir input# copy some file having text data to run word count
hadoop fs -copyFromLocal /usr/lib/gphd/hadoop/CHANGES.txt input# run word count
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input output# dump output on console
hadoop fs -cat output/part*[/php]

7. Stop Hadoop Cluster:

[php]icm_client stop -l my-hd-cluster[/php]

8. Related Links:

Reference: Pivotal

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google

Install Pivotal Hadoop v-2 Cluster in production

1. Introduction

2. Recommended Platform

3. Prerequisite:

3.1. Install Java

3.2. Update hosts file

3.3. Change system parameters

3.3.1. Change hostname

3.3.2. Disable SELinux:

3.3.3. Update Operating System Parameters

3.3.3.1. Edit /etc/sysconfig/cpuspeed

3.3.3.2. Stop cpuspeed service

3.3.3.3. Stop iptables service

3.3.3.4. Stop ip6tables service

3.3.3.5. Stop iptables

3.3.3.6. Stop ip6tables

3.3.3.7. Stop cpuspeed

3.3.3.8. Edit /etc/sysctl.conf (OPTIONAL)

3.3.3.9. Edit /etc/security/limits.conf

3.3.3.10. Edit /etc/security/limits.d/90-nproc.conf

3.3.3.11. Edit /etc/rc.d/rc.local

3.4. Install Following packages

4. Install PCC (Pivotal Command Centre)

5. Deploy Pivotal Hadoop

5.1. Enable namenode high-availability

6. Deploy the Cluster

6.1. Troubleshooting

6.2. Start Hadoop Cluster

6.3. Run sample program:

7. Stop Hadoop Cluster:

8. Related Links:

Leave a Reply Cancel reply