What are configuration files in Apache Hadoop?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What are configuration files in Apache Hadoop?

Viewing 2 reply threads
  • Author
    Posts
    • #6223
      DataFlair TeamDataFlair Team
      Spectator

      Name different configuration files in Hadoop?
      List out Hadoop configuration files?

    • #6225
      DataFlair TeamDataFlair Team
      Spectator

      Edit the following Core Hadoop Configuration files to setup the cluster.
      • hadoop-env.sh
      • core-site.xml
      • hdfs-site.xml
      • mapred-site.xml
      • masters
      • slaves
      HADOOP_HOME directory (the extracted directory(etc) is called asHADOOP_HOME. e.g. hadoop-2.6.0-cdh5.5.1) contain all the libraries, scripts, configuration files, etc.

      hadoop-env.sh

      1. This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
      As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh.

      2. This variable directs Hadoop daemon to the Java path in the system
      Actual:export JAVA_HOME=<path-to-the-root-of-your-Java-installation>
      Change:export JAVA_HOME=</usr/lib/jvm/java-8-oracle/>
      core-site.sh
      3. This file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

      <property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:9000</value>
      </property>
      <property>
      <name>hadoop.tmp.dir</name>
      <value>/home/dataflair/Hadmin</value>
      </property>

       Location of namenode is specified by fs.defaultFS property
       namenode running at 9000 port on localhost.
       hadoop.tmp.dir property to specify the location where temporary as well as permanent data of Hadoop will be stored.
       “/home/dataflair/hadmin” is my location; here you need to specify a location where you have Read Write privileges.

      hdfs-site.sh
       we need to make changes in Hadoop configuration file hdfs-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command:
      Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano hdfs-site.xml

      Replication factor

      <property>
      <name>dfs.replication</name>
      <value>1</value>
      </property>

       Replication factor is specified by dfs.replication property;
       as it is a single node cluster hence we will set replication to 1.

      mapred-site.xml
       we need to make changes in Hadoop configuration file mapred-site.xml (which is located in HADOOP_HOME/etc/hadoop)
       Note: In order to edit mapred-site.xml file we need to first create a copy of file mapred-site.xml.template. A copy of this file can be created using the following command:
      Hdata@ubuntu:~/ hadoop-2.6.0-cdh5.5.1/etc/hadoop$ cp mapred-site.xml.template mapred-site.xml
       We will now edit the mapred-site.xml file by using the following command:
      Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano mapred-site.xml
      Changes

      <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
      </property>

      In order to specify which framework should be used for MapReduce, we use mapreduce.framework.name property, yarn is used here.

      yarn-site.xml
      Changes

      <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
      </property>
      <property>
      <name>yarn.nodemanager.aux-services.mapreduce.
      shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
      </property>

       In order to specify auxiliary service need to run with nodemanager“yarn.nodemanager.aux-services” property is used.
       Here Shuffling is used as auxiliary service. And in order to know the class that should be used for shuffling we user “yarn.nodemanager.aux-services.mapreduce.shuffle.class”

    • #6227
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop has following configuration files:

      -> hadoop-env.sh
      -> core-site.xml
      -> hdfs-site.xml
      -> mapred-site.xml
      -> masters
      -> slaves
      -> yarn-site.xml

      Note: We need to configure the first 4 config files if we are setting up hadoop in single node cluster.

      I have Single node cluster hadoop installed and configured, i will use my configuration files to explain:

      1) core-site.xml

      This file defines port number, memory, memory limits, size of read/write buffers used by Hadoop. Find this file in the etc/hadoop directory

      <configuration>
        <property>
          <name>fs.default.name</name>
          <value>hdfs://localhost:9000</value>
        </property>
      </configuration>

      Here hostname and port are the machine and port on which NameNode daemon runs and listens.
      This sets the URI for all filesystem requests in Hadoop.This file informs Hadoop daemon where NameNode runs in the cluster.

      2) hdfs-site.xml

      This is the main configuration file for HDFS. It defines the namenode and datanode paths as well as replication factor. Find this file in the etc/hadoop/ directory

      <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
        <property>
            <name>dfs.name.dir</name>
            <value>file:///home/hadoop/hdfs/namenode </value>
        </property>
        <property>
            <name>dfs.data.dir</name>
            <value>file:///home/hadoop/hdfs/datanode </value>
        </property>
      </configuration>

      Notice how we set the replication factor via the dfs.replication property. We define the namenode path dfs.name.dir to point to an hdfs directory under the hadoop user folder. We point the data node path dfs.data.dir to a similar destination.
      Hdfs-site.xml is a client configuration file needed to access HDFS — it needs to be placed on every node that has some HDFS role running. It contains settings for the Active and Secondary NameNodes and the DataNodes and can be used to set up defaults for WebUI address and port, block replication count and reporting interval, datanode port and total server threads, permission checking on HDFS etc.

      The value “true” for property ‘dfs.permissions’ enables permission checking in HDFS and the value “false” turns off the permission checking.

      3) yarn-site.xml

      Yarn is a resource management platform for Hadoop. To configure Yarn, find the yarn-site.xml file in the /etc/hadoop/ directory

      <configuration>
        <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
        </property>
      </configuration>

      The yarn.nodemanager.aux-services property tells NodeManagers that there will be an auxiliary service called mapreduce.shuffle that they need to implement. After we tell the NodeManagers to implement that service, we give it a class name as the means to implement that service. This particular configuration tells MapReduce how to do its shuffle. Because NodeManagers won’t shuffle data for a non-MapReduce job by default, we need to configure such a service for MapReduce.

      4) mapred-site.xml

      This file defines the MapReduce framework for Hadoop.

      <configuration>
        <property>
          <name>mapreduce.framework.name</name>
           <value>yarn</value>
        </property>
      </configuration>

      A new configuration option for Hadoop 2 is the capability to specify a framework name for MapReduce, setting the mapreduce.framework.name property. In this install, we will use the value of “yarn” to tell MapReduce that it will run as a YARN application.

      Please visit this link https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml for more properties of mapped-site.xml

      5) hadoop-env.sh

      This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
      As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh. This variable directs Hadoop daemon to the Java path in the system.

      6) master:
      This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file at Master server contains a hostname Secondary Name Node servers.

      7) Slave:
      The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node.

      The ‘slaves’ file on Slave server contains the IP address of the slave node.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.