HDFS Disk Balancer – Learn how to Balance Data on DataNode

1. Hadoop Disk Balancer: Objective

This blog on Disk Balancer will provide you the detailed view of Hadoop HDFS Balancer. In this tutorial on Hadoop Balancer, we will cover what exactly is Hadoop balancer in Hadoop, operations of HDFS balancer in HDFS, what is the need of Intra-data node balancer in HDFS, what are the capabilities of Hadoop balancer in HDFS? Do let us know if you face any query in HDFS Balancer, Please ask us in Comments.

HDFS Disk Balancer

HDFS Disk Balancer – Learn how to Balance Data on DataNode

2. Introduction to HDFS Disk Balancer

HDFS provides a command line tool called Diskbalancer. It distributes data in a uniform way on all disks of a datanode.
Disk balancer is different from Balancer, which analyzes data block placement and balances data across the datanodes.
HDFS might not always place data in a uniform way across the disks due to following reasons:

  • A lot of writes and deletes
  • Disk replacement

This can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with Inter, Non-Intra, DN skew.
This situation is handled by the new Intra-DataNode Balancing functionality, which is invoked via the HDFS Disk Balancer CLI.
HDFS Balancer in Hadoop work against a given datanode and moves blocks from one disk to another.

If these professionals can make a switch to Big Data, so can you:
Rahul Doddamani Story - DataFlair
Rahul Doddamani
Java → Big Data Consultant, JDA
Follow on
Mritunjay Singh Success Story - DataFlair
Mritunjay Singh
PeopleSoft → Big Data Architect, Hexaware
Follow on
Rahul Doddamani Success Story - DataFlair
Rahul Doddamani
Big Data Consultant, JDA
Follow on
I got placed, scored 100% hike, and transformed my career with DataFlair
Enroll now
Deepika Khadri Success Story - DataFlair
Deepika Khadri
SQL → Big Data Engineer, IBM
Follow on
DataFlair Web Services
You could be next!
Enroll now

3. The operation of Disk Balancer in Hadoop

Hadoop HDFS balancer works by creating a plan (set of statements) and executing that plan on the datanode. A plan describes how much data should move between two disks. A plan has many move steps. Move step have source disk, destination disk and a number of bytes to move. A plan can execute against an operational datanode.
By default, disk balancer is not enabled; To enable disk balancer dfs.disk.balancer.enabled must be set true in hdfs-site.xml.

Also Read: HDFS Federation in Hadoop – Architecture and Benefits

4. HDFS Intra-DataNode Disk Balancer

When we write new block in HDFS, datanode uses volume choosing a policy to choose the disk for the block. Each directory is a volume in HDFS terminology. Two such policies are:

  • Round-robin: It distributes the new blocks in a uniform way across the available disks.
  • Available space: It writes data to the disk that has most free space (by percentage).
HDFS Disk Balancing Policies - Round Robin, Available Space

Hadoop Balancer

DataNode uses the round-robin policy by default. In a long-running cluster, due to massive file deletion and addition in HDFS, it is still possible for datanode to create significant imbalance volumes. Even available space based volume-choosing policy can still lead to less efficient disk I/O.
For instance, every new write will go to the new added empty disk while the other disks are idle during the period, creating a bottleneck on the new disk.
Apache Hadoop community developed server offline scripts to reduce the data imbalance issue.
HDFS-1312 also introduced an online disk balancer that re-balance the volumes on a running datanode based on various metrics.
Any doubt in this section of Hadoop Balancer? Please Comment.

Join DataFlair on Telegram
Hadoop Quiz

5. Abilities of HDFS Disk Balancer

5.1. Data spread report

Through metrics, we can measure how data to spread and compared to a different volume on the node.

a. Volume data density or Intra-node data density

This metrics computes how much data is on a node and what is ideal storage on each volume.     Compute this by total data stored on that node divided by total disk capacity on that node.
Ideal storage= total used % total capacity
Volume data density= ideal storage – dfsUsedRatio
A positive value of volume data density indicates that disk is under-utilized. A negative value indicates that disk is over-utilized.

b. Node data density or inter-node data density

Now we have volume data density, so we can compare which all nodes in the data center need to balance?

c. Reports

Once we have volume data density and node data density, disk balancer now (balance) the top 20 nodes in the cluster that have the skewed data distribution.

5.2. Balance data between volume while datanode are alive

Disk balancer has the ability to move data from one volume to another.

6. HDFS Balancer: Conclusion

In conclusion to Hadoop Balancer, we can say it is the tool which distributes data on all disks of a datanode. It works by creating a plan (set of statements) and executing that plan on the datanode.
HDFS balancer uses Round-robin and Available space policies for choosing the disk for the block.
If you like this post on Disk balancer or have any query related to HDFS disk balancer in Hadoop so leave a comment.
See Also-

Reference:
Apache Hadoop

2 Responses

  1. Nakul Kundra says:

    Sir, which policy to use Round Robin or Available Space and why?

  2. ram says:

    Hi ,
    We have 5 node cluster using plain vanilla apache, recently we added a new node into the cluster now total we have 6 node cluster,after adding that new node into our cluster we ran the balancer.but its running very slowly, still its running, can any one please say how much time it will takes for complete balancing and how we get to know that balancing is completed.
    Thanks…..

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.