What are configuration files in Apache Hadoop?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 5:24 pm #6223
  
  DataFlair Team
  Spectator
  
  Name different configuration files in Hadoop?
  List out Hadoop configuration files?
- September 20, 2018 at 5:24 pm #6225
  DataFlair Team
  Spectator
  Edit the following Core Hadoop Configuration files to setup the cluster.
  • hadoop-env.sh
  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml
  • masters
  • slaves
  HADOOP_HOME directory (the extracted directory(etc) is called asHADOOP_HOME. e.g. hadoop-2.6.0-cdh5.5.1) contain all the libraries, scripts, configuration files, etc.
  
  hadoop-env.sh
  
  1. This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
  As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh.
  
  2. This variable directs Hadoop daemon to the Java path in the system
  Actual:export JAVA_HOME=<path-to-the-root-of-your-Java-installation>
  Change:export JAVA_HOME=</usr/lib/jvm/java-8-oracle/>
  core-site.sh
  3. This file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.
```
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/dataflair/Hadmin</value>
</property>
```
   Location of namenode is specified by fs.defaultFS property
   namenode running at 9000 port on localhost.
   hadoop.tmp.dir property to specify the location where temporary as well as permanent data of Hadoop will be stored.
   “/home/dataflair/hadmin” is my location; here you need to specify a location where you have Read Write privileges.
  
  hdfs-site.sh
   we need to make changes in Hadoop configuration file hdfs-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command:
  Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano hdfs-site.xml
  
  Replication factor
```
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
```
   Replication factor is specified by dfs.replication property;
   as it is a single node cluster hence we will set replication to 1.
  
  mapred-site.xml
   we need to make changes in Hadoop configuration file mapred-site.xml (which is located in HADOOP_HOME/etc/hadoop)
   Note: In order to edit mapred-site.xml file we need to first create a copy of file mapred-site.xml.template. A copy of this file can be created using the following command:
  Hdata@ubuntu:~/ hadoop-2.6.0-cdh5.5.1/etc/hadoop$ cp mapred-site.xml.template mapred-site.xml
   We will now edit the mapred-site.xml file by using the following command:
  Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano mapred-site.xml
  Changes
```
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
```
  In order to specify which framework should be used for MapReduce, we use mapreduce.framework.name property, yarn is used here.
  
  yarn-site.xml
  Changes
```
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.
shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
```
   In order to specify auxiliary service need to run with nodemanager“yarn.nodemanager.aux-services” property is used.
   Here Shuffling is used as auxiliary service. And in order to know the class that should be used for shuffling we user “yarn.nodemanager.aux-services.mapreduce.shuffle.class”
- September 20, 2018 at 5:25 pm #6227
  DataFlair Team
  Spectator
  Hadoop has following configuration files:
  
  -> hadoop-env.sh
  -> core-site.xml
  -> hdfs-site.xml
  -> mapred-site.xml
  -> masters
  -> slaves
  -> yarn-site.xml
  
  Note: We need to configure the first 4 config files if we are setting up hadoop in single node cluster.
  
  I have Single node cluster hadoop installed and configured, i will use my configuration files to explain:
  
  1) core-site.xml
  
  This file defines port number, memory, memory limits, size of read/write buffers used by Hadoop. Find this file in the etc/hadoop directory
```
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
```
  Here hostname and port are the machine and port on which NameNode daemon runs and listens.
  This sets the URI for all filesystem requests in Hadoop.This file informs Hadoop daemon where NameNode runs in the cluster.
  
  2) hdfs-site.xml
  
  This is the main configuration file for HDFS. It defines the namenode and datanode paths as well as replication factor. Find this file in the etc/hadoop/ directory
```
<configuration>
  <property>
      <name>dfs.replication</name>
      <value>1</value>
  </property>
  <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hdfs/namenode </value>
  </property>
  <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hdfs/datanode </value>
  </property>
</configuration>
```
  Notice how we set the replication factor via the dfs.replication property. We define the namenode path dfs.name.dir to point to an hdfs directory under the hadoop user folder. We point the data node path dfs.data.dir to a similar destination.
  Hdfs-site.xml is a client configuration file needed to access HDFS — it needs to be placed on every node that has some HDFS role running. It contains settings for the Active and Secondary NameNodes and the DataNodes and can be used to set up defaults for WebUI address and port, block replication count and reporting interval, datanode port and total server threads, permission checking on HDFS etc.
  
  The value “true” for property ‘dfs.permissions’ enables permission checking in HDFS and the value “false” turns off the permission checking.
  
  3) yarn-site.xml
  
  Yarn is a resource management platform for Hadoop. To configure Yarn, find the yarn-site.xml file in the /etc/hadoop/ directory
```
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>
```
  The yarn.nodemanager.aux-services property tells NodeManagers that there will be an auxiliary service called mapreduce.shuffle that they need to implement. After we tell the NodeManagers to implement that service, we give it a class name as the means to implement that service. This particular configuration tells MapReduce how to do its shuffle. Because NodeManagers won’t shuffle data for a non-MapReduce job by default, we need to configure such a service for MapReduce.
  
  4) mapred-site.xml
  
  This file defines the MapReduce framework for Hadoop.
```
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
     <value>yarn</value>
  </property>
</configuration>
```
  A new configuration option for Hadoop 2 is the capability to specify a framework name for MapReduce, setting the mapreduce.framework.name property. In this install, we will use the value of “yarn” to tell MapReduce that it will run as a YARN application.
  
  Please visit this link https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml for more properties of mapped-site.xml
  
  5) hadoop-env.sh
  
  This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
  As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh. This variable directs Hadoop daemon to the Java path in the system.
  
  6) master:
  This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file at Master server contains a hostname Secondary Name Node servers.
  
  7) Slave:
  The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node.
  
  The ‘slaves’ file on Slave server contains the IP address of the slave node.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What are configuration files in Apache Hadoop?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses