What is balancer? How to run a cluster balancing utility?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What is balancer? How to run a cluster balancing utility?

Viewing 1 reply thread
  • Author
    Posts
    • #5362
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop, HDFS data might not always be placed evenly across the DataNode due to the addition of new DataNodes to an existing cluster. While placing new Blocks, NameNode considers various parameters before choosing the DataNodes to receive these blocks.
      HDFS provides a tool called Balancer, that analyzes block placement and rebalances data across the DataNode, and it is generally managed by the Hadoop Administrator

      To run a cluster balancing utility we run the following command
      $ hadoop balancer [-threshold ]

      where -threshold is the percentage of disk capacity. This overwrites the default threshold.

    • #5365
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop, HDFS new blocks are allocated evenly among all the datanodes. But in large scale cluster, each node has different capacity, you will often need to add new nodes or remove old nodes for better performance. Then How Hadoop will balance the data usage on all data nodes?

      The answer is that Hadoop has its balanced policy to make sure all nodes data are balanced , So, there is HDFS Balancer to rebalance among the cluster datanodes, for unbalanced situation like new nodes adding, deletion caused unbalancing etc.

      HDFS balancer doesn’t run at background, has to run manually. To run HDFS balancer Command :
      hdfs balancer [-threshold <threshold>]Percentage of disk capacity

      The threshold parameter is number between 0 and 100 .
      From the average cluster utilization, the balancer process will try to converge all datanodes’ usage in the range [average – threshold, average + threshold].

      Default threshold is 10%

      For example, if the cluster current utilization is 50% full, then higher usage datanodes will start move data to lower usage nodes.

      – Higher (average + threshold): 60%
      – Lower (average – threshold): 40%

Viewing 1 reply thread
  • You must be logged in to reply to this topic.