What is rack awareness in Hadoop HDFS?

Viewing 5 reply threads
  • Author
    Posts
    • #5465
      DataFlair TeamDataFlair Team
      Spectator

      What is rack awareness in Hadoop HDFS?
      What is the need of Rack awareness?

    • #5466
      DataFlair TeamDataFlair Team
      Spectator

      Rack– It the collection of machines around 40-50. All these machines are connected using the same network switch and if that network goes down then all machines in that rack will be out of service. Thus we say rack is down.

      Rack Awareness was introduced by Apache Hadoop to overcome this issue. In Rack Awareness, NameNode chooses the DataNode which is closer to the same rack or nearby rack. NameNode maintains Rack ids of each DataNode to achieve rack information. Thus, this concept chooses Datanodes based on the rack information. NameNode in hadoop makes ensures that all the replicas should not stored on the same rack or single rack. Rack Awareness Algorithm reduces latency as well as Fault Tolerance.

      Default replication factor is 3. Therefore according to Rack Awareness Algorithm:

      • The first replica of the block will store on a local rack.
      • The next replica will store on another datanode within the same rack.
      • The third replica stored on the different rack.

      In Hadoop, we need Rack Awareness for below reason: It improves

      • Data high availability and reliability.
      • The performance of the cluster.
      • Network bandwidth.

      To study in detail please follow Rack Awareness in Hadoop

    • #5468
      DataFlair TeamDataFlair Team
      Spectator

      A Rack is a collection nodes usually in 10 of nodes which are closely stored together and all nodes are connected to a same Switch. When an user requests for a read/write in a large cluster of Hadoop in order to improve traffic the namenode chooses a datanode that is closer this is called Rack Awareness .

      Here is an example to describe Rack Awareness:
      Google Maps: when we input the location where you want to go in the maps we will get lot of routes but the map will automatically select the route with short distance so that you can arrive your destination in less time same goes with Rack Awareness in Hadoop it will select the nearest rack to improve network traffic, high data availability.

      Once the data is read/written it will store a replica copy locally and replicate it to another datanode and from there to other depending upon the replication factor if it is 3 then 1 local copy + 2 replica copies.

    • #5469
      DataFlair TeamDataFlair Team
      Spectator

      The main advantage of Hadoop is high data availability, fast and parallel read/write operation.
      So Rack Awareness is a way to achieve these above-mentioned advantages.

      Generally, a cluster, having more than 30-40 nodes are configured in multiple racks. So the communication between the nodes on the same rack will be faster as compared to the racks which are far away.
      In a large cluster of Hadoop, in order to improve network traffic while read/write operation, NameNode chooses the datanodes which are on the same rack or near by rack
      Actually, Namenode has the rack id of all the racks through which it maintains the information about each rack. So this concept of choosing closer datanodes based on rack information is called Rack Awareness.

      As discussed above Namenode replicates data across the datanodes in such a way to increase data availability and decrease latency along with high data availability.

    • #5470
      DataFlair TeamDataFlair Team
      Spectator

      1) Usually in Hadoop clusters which is made up of more than 30 or 40 nodes are configured in multiple racks. If your cluster runs on a single rack, then there is additional configuration needed. But in Multirack configuration, Hadoop should be aware about the topology of the nodes in cluster to gain the maximum performance benefits.

      2) So Rack Awareness . is having the information about the location of each data node distributed across the Racks in Hadoop cluster.

      3) Communication between two data nodes on the same rack is more efficient than the same between two nodes on different racks. So NameNode maintains the rack ID and measures the distance between the node while placing the data and their replicas on the DataNodes.

      4) To Improve the efficiency of reading and writing of data and to lower the network bandwidth, NameNode always chooses the nearby DataNodes which are on the same racks.

      5) The main purpose of Rack awareness is:

      i) Increasing the availability of data block
      ii) Better cluster performance

    • #5471
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop, there are many nodes which are stored in racks. The namenode selects the data node which is closer to the same or nearby rack to Read/Write request. The above process improves the network traffic while reading/writing the HDFS file. All these racks are maintained by the Namenode. Moreover, these blocks are stored in the different racks while creating the replicas of blocks.

      Let’s take an example-
      If the replication factor is three then we have to replicate the block three times. So, the first block block-1 is stored in the same rack and block-2 and block-3 are stored in the rack-2. If one rack is damaged then the copy of block will be available in another rack. In this, way, we can handle the data loss. Hence the process of storing the blocks in different racks is called rack awareness.

      The purpose of rack awareness is:
      1. For improving the high availibility and reliability.
      2. The cluster performance is improved.
      3. To improve the bandwidth of a network.

Viewing 5 reply threads
  • You must be logged in to reply to this topic.