What should be the ideal replication factor on a Hadoop Cluster?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What should be the ideal replication factor on a Hadoop Cluster?

Viewing 3 reply threads
  • Author
    Posts
    • #5618
      DataFlair TeamDataFlair Team
      Spectator

      To decide what would be the ideal replication factor for your Hadoop cluster, analyse these three parameters.
      1. Find out how many nodes are there in the cluster, make sure you cannot have more replicas than nodes in your cluster.
      2. How much data you plan to store ? Your expected usage.
      3. Your cluster’s storage capacity.

      * well 3 replication factor is set by default as well as in common practice.

    • #5620
      DataFlair TeamDataFlair Team
      Spectator

      Replication Factor
      It is basically the number of times Hadoop framework replicate each and every Data Block. Block is replicated to provide Fault Tolerance. The default replication factor is 3 which can be configured as per the requirement; it can be changed to 2(less than 3) or can be increased (more than 3.).

      Because of the following reason, ideal replication factor is 3:

      If one copy is not accessible and corrupted then the data can be read from other copy.
      You will have ample time to send an alert to namenode and recover the duplication of the failed node into a new node.
      In meantime if the second node also failed unplanned, you will still have one active with your critical data to process.
      3x replication also serves the Hadoop Rack Awareness scenario. Hence replication factor 3 works properly in all situations without over replicating data and replication less than 3 could be challenging at the time data recovery.

      Below parameter should be considered while replication factor selection:

      Cost of failure of the node.
      A cost of replication.
      Relative probability of failure for the node.

    • #5621
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop, once the client is finished writing the data in the Slave node, the Slave node replicates the data on other Data Nodes so that the data remains highly available, Highly reliable and Fault Tolerance.

      The Replication factor normally depends on Requirement, but by default, the Replication factor is 3, but it can be 2 or greater than that also.

      The Main reason to keep that replication factor as 3 is, that suppose a particular data node is own then the blocks in it won’t be accessible, but with replication factor is 3 here, its copies will be stored on different data nodes, suppose the 2nd Data Node also goes down, but still that Data will be Highly Available as we still one copy of it stored onto different Data Node.

      Replication Factor also depends on Rack Awareness, which states the rule as “No more than one copy on the same node and no more than 2 copies in the same rack’, with this rule he data remains highly available if the replication factor is 3.

    • #5622
      DataFlair TeamDataFlair Team
      Spectator

      The ideal replication factor is considered to be 3. Why so? Below are the reasons-
      As we know HDFS is designed to be Fault Tolerance. To keep up with this feature, HDFS replicates the data sent to it for storage so that the data is available. But how will the data be available by just creating replicas?

      HDFS ensures that the replicas are stored in such a way that if one DataNodes fails, the data is available in another node. For this, HDFS needs to store copy of data in different node.

      But what if the entire Rack fails? For this, HDFS needs to keep data in another Rack.
      Hence, ideally HDFS stores one copy of block of data in one node of one Rack and other two copies in different nodes of different Rack. This ensures Fault tolerance and High availability.

Viewing 3 reply threads
  • You must be logged in to reply to this topic.