HDFS Disk Balancer – Learn how to Balance Data on DataNode
1. Hadoop Disk Balancer: Objective
This blog on Disk Balancer will provide you the detailed view of Hadoop HDFS Balancer. In this tutorial on Hadoop Balancer, we will cover what exactly is Hadoop balancer in Hadoop, operations of HDFS balancer in HDFS, what is the need of Intra-data node balancer in HDFS, what are the capabilities of Hadoop balancer in HDFS? Do let us know if you face any query in HDFS Balancer, Please ask us in Comments.
2. Introduction to HDFS Disk Balancer
HDFS provides a command line tool called Diskbalancer. It distributes data in a uniform way on all disks of a datanode.
Disk balancer is different from Balancer, which analyzes data block placement and balances data across the datanodes.
HDFS might not always place data in a uniform way across the disks due to following reasons:
- A lot of writes and deletes
- Disk replacement
This can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with Inter, Non-Intra, DN skew.
This situation is handled by the new Intra-DataNode Balancing functionality, which is invoked via the HDFS Disk Balancer CLI.
HDFS Balancer in Hadoop work against a given datanode and moves blocks from one disk to another.
If these professionals can make a switch to Big Data, so can you:
3. The operation of Disk Balancer in Hadoop
Hadoop HDFS balancer works by creating a plan (set of statements) and executing that plan on the datanode. A plan describes how much data should move between two disks. A plan has many move steps. Move step have source disk, destination disk and a number of bytes to move. A plan can execute against an operational datanode.
By default, disk balancer is not enabled; To enable disk balancer dfs.disk.balancer.enabled must be set true in hdfs-site.xml.
4. HDFS Intra-DataNode Disk Balancer
When we write new block in HDFS, datanode uses volume choosing a policy to choose the disk for the block. Each directory is a volume in HDFS terminology. Two such policies are:
- Round-robin: It distributes the new blocks in a uniform way across the available disks.
- Available space: It writes data to the disk that has most free space (by percentage).
DataNode uses the round-robin policy by default. In a long-running cluster, due to massive file deletion and addition in HDFS, it is still possible for datanode to create significant imbalance volumes. Even available space based volume-choosing policy can still lead to less efficient disk I/O.
For instance, every new write will go to the new added empty disk while the other disks are idle during the period, creating a bottleneck on the new disk.
Apache Hadoop community developed server offline scripts to reduce the data imbalance issue.
HDFS-1312 also introduced an online disk balancer that re-balance the volumes on a running datanode based on various metrics.
Any doubt in this section of Hadoop Balancer? Please Comment.
5. Abilities of HDFS Disk Balancer
5.1. Data spread report
Through metrics, we can measure how data to spread and compared to a different volume on the node.
a. Volume data density or Intra-node data density
This metrics computes how much data is on a node and what is ideal storage on each volume. Compute this by total data stored on that node divided by total disk capacity on that node.
Ideal storage= total used % total capacity
Volume data density= ideal storage – dfsUsedRatio
A positive value of volume data density indicates that disk is under-utilized. A negative value indicates that disk is over-utilized.
b. Node data density or inter-node data density
Now we have volume data density, so we can compare which all nodes in the data center need to balance?
Once we have volume data density and node data density, disk balancer now (balance) the top 20 nodes in the cluster that have the skewed data distribution.
5.2. Balance data between volume while datanode are alive
Disk balancer has the ability to move data from one volume to another.
6. HDFS Balancer: Conclusion
In conclusion to Hadoop Balancer, we can say it is the tool which distributes data on all disks of a datanode. It works by creating a plan (set of statements) and executing that plan on the datanode.
HDFS balancer uses Round-robin and Available space policies for choosing the disk for the block.
If you like this post on Disk balancer or have any query related to HDFS disk balancer in Hadoop so leave a comment.