Rack Awareness in Hadoop HDFS – An Introductory Guide

1. Objective

This Hadoop tutorial will help you in understanding Hadoop rack awareness concept, racks in Hadoop environment, why rack awareness is needed, replica placement policy in Hadoop via Rack awareness and advantages of implementing rack awareness in Hadoop HDFS.

Rack Awareness in Hadoop HDFS - An Introductory Guide

Rack Awareness in Hadoop HDFS – An Introductory Guide

2. What is Rack Awareness in Hadoop HDFS?

In a large cluster of Hadoop, in order to improve the network traffic while reading/writing HDFS file, namenode chooses the datanode which is closer to the same rack or nearby rack to Read/Write request. Namenode achieves rack information by maintaining the rack id’s of each datanode. This concept that chooses closer datanodes based on the rack information is called Rack Awareness in Hadoop.
Rack awareness is having the knowledge of Cluster topology or more specifically how the different data nodes are distributed across the racks of a Hadoop cluster. Default Hadoop installation assumes that all data nodes belong to the same rack.

If these professionals can make a switch to Big Data, so can you:
Rahul Doddamani Story - DataFlair
Rahul Doddamani
Java → Big Data Consultant, JDA
Follow on
Mritunjay Singh Success Story - DataFlair
Mritunjay Singh
PeopleSoft → Big Data Architect, Hexaware
Follow on
Rahul Doddamani Success Story - DataFlair
Rahul Doddamani
Big Data Consultant, JDA
Follow on
I got placed, scored 100% hike, and transformed my career with DataFlair
Enroll now
Deepika Khadri Success Story - DataFlair
Deepika Khadri
SQL → Big Data Engineer, IBM
Follow on
DataFlair Web Services
You could be next!
Enroll now

3. Why Rack Awareness?

In Big data Hadoop, rack awareness is required for below reasons:

  • To improve data high availability and reliability.
  • Improve the performance of the cluster.
  • To improve network bandwidth.
  • Avoid losing data if entire rack fails though the chance of the rack failure is far less than that of node failure.
  • To keep bulk data in the rack when possible.
  • An assumption that in-rack id’s higher bandwidth, lower latency.
Hadoop Quiz

4. Replica Placement via Rack Awareness in Hadoop

Placement of replica is critical for ensuring high reliability and performance of HDFS. Optimizing replica placement via rack awareness distinguishes HDFS from other Distributed File System. Block Replication in multiple racks in HDFS is done using a policy as follows:

“No more than one replica is placed on one node. And no more than two replicas are placed on the same rack. This has a constraint that the number of racks used for block replication should be less than the total number of block replicas”.

Join DataFlair on Telegram

For Example:
When a new block is created: The First replica is placed on the local node. The Second one is placed on a different rack and the third one is placed on a different node at the local rack.

When re-replicating a block, if the number of an existing replica is one, place the second one on the different rack. If the number of an existing replica is two and if the two existing replicas are on the same rack, the third replica is placed on a different rack.

A simple but nonoptimal policy is to place replicas on the different racks. This prevents losing data when an entire rack fails and allows us to use bandwidth from multiple racks while reading the data. This policy evenly distributes the data among replicas in the cluster which makes it easy to balance load in case of component failure. But the biggest drawback of this policy is that it will increase the cost of write operation because a writer needs to transfer blocks to multiple racks and communication between the two nodes in different racks has to go through switches.

In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks. That’s why we use replica replacement policy. The chance of the rack failure is far less than that of node failure. It does not impact on data reliability and availability guarantee. However, it does reduce the aggregate network bandwidth used when reading data since a block replica is placed in only two unique racks rather than three.

4.1. What about performance?

  • Faster replication operation: Since the replicas are placed within the same rack it would use higher bandwidth and lower latency hence making it faster.
  • If YARN is unable to create a container in the same data node where the queried data is located it would try to create the container in a data node within the same rack. This would be more performant because of the higher bandwidth and lower latency of the data nodes inside the same rack.

5. Advantages of Implementing Rack Awareness

  • Minimize the writing cost and Maximize read speed – Rack awareness places read/write requests to replicas on the same or nearby rack. Thus minimizing writing cost and maximizing reading speed.
  • Provide maximize network bandwidth and low latency – Rack awareness maximizes network bandwidth by blocks transfer within a rack. Especially with rack awareness, the YARN is able to optimize MapReduce job performance. It assigns tasks to nodes that are ‘closer’ to their data in terms of network topology. This is particularly beneficial in cases where tasks cannot be assigned to nodes where their data is stored locally.
  • Data protection against rack failure – By default, the namenode assigns 2nd & 3rd replicas of a block to nodes in a rack different from the first replica. This provides data protection even against rack failure; however, this is possible only if Hadoop was configured with knowledge of its rack configuration.

See Also-

9 Responses

  1. Elma says:

    Nicely written and explained Rack awareness concept on Hadoop HDFS.

    • Data Flair says:

      Hii Elma,
      Thank you for reading the complete article on Rack Awareness in Hadoop HDFS and giving us a valuable feedback. Hope by reading the article, you got the reason to learn Rack Awareness and its Advantages also.
      Keep visiting Data Flair for more such explanatory articles on Hadoop HDFS.

  2. Florian says:

    Great article for new users to understand rack awareness in HDFS. Thanks.

  3. Thayanban E says:

    How namenode choose datanodes which is closer to the same rack or different rack for read and write request….I cannot understand the line….can u explain in very detail

    • Nikhil Khojare says:

      Its a client who request hdfs read/write operations, so name node will first check whether the hdfs client from which request came is part of cluster or not, if part of cluster it will try to find its rack and fetch data from the nearer rack as far as possible. Hope it clarifies.

  4. Jyothi says:

    great article.. very helpful.. I wish adding simple diagram to illustrate concept will be more helpful

  5. Mohan says:

    Explained very nice.

    I believe in cloud different subnets called racks.so I can deploy my data nodes between different nodes.do you think this is possible on cloud.

  6. Kiran says:

    correct me if im wrong, in the example 1st block is stored in local node, second block stored in second node in second rack and third block in 2 rack 3rd node. But in actual

    block1 – local node
    block2 – 2nd node(2nd rack)
    block3 – 2nd node(2nd rack)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.