Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Hadoop Ideally what should be the replication factor in Hadoop Cluster ?

This topic contains 4 replies, has 1 voice, and was last updated by  dfbdteam3 1 year ago.

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
  • #5547


    Ideally what should be the replication factor in Hadoop Cluster ?



    Replication factor is basically the no.of times we are going to replicate every single Data Block. So, in Hadoop, we have replication factor by default as 3, and the replication in hadoop is not the drawback, in fact it makes hadoop effective and efficient by incorporating the feature like Fault Tolerant.

    There is a flexibility to change the replication factor in hadoop, i.e it can be changed to 2(less than 3) or can be increased(more than 3). However it is considered ideally to have replication factor as 3, because:

    If one node of your’s goes down, you still have fault tolerant with 2 nodes and your critical data is saved in these two nodes successfully.
    Also, you have ample time to send an alert to name node and recover the duplication of the failed node into a new node.
    And in the meantime, if the 2nd node also fails unplanned, you still have one node active with your critical data to process.
    Hence replication factor 3 is considered to best fit, less than that could be challenging during data recovery, and higher no of the node are known as cost prone.

    To learn more about Replication follow: Hadoop Features



    The default replication factor is 3 in Hadoop.

    An ideal replication factor is 3 for the following reasons :

    1) Hadoop is used in clustered environment where you have clusters, each cluster will have multiple racks, each rack will have multiple datanodes.

    2) To make HDFS Fault Tolerant we need to consider: Datanode failure and Rack failure.
    Since replica placement strategy is in the case of replication factor 3 tries to keep 2 copies of replicas in the same Rack and the remaining copy in other Rack.

    So in above cases, we need to make sure that –

    If one DataNode fails, you can get the same data from another DataNode
    IIf the entire Rack fails, you can get the same data from another Rack.
    Due to these, we fulfill the fault tolerant criteria.

    However, Hadoop is flexible in changing replication factor from configuration file hdfs-site.xml with dfs.replication property.



    Default Replication Factor is 3. Ideally this should be replication factor.
    Taking an example, if in a 3x replication cluster, we plan a maintenance activity on one node out of three nodes, suddenly another node stops working, in that case, we still have a node which is available and makes Hadoop Fault Tolerant.
    3x replication also serves the Rack Awareness scenario. So replication factor 3 works perfectly for all situations without over replicating data.
    We can change replication factor in hdfs-site.xml



    Default replication factor is 3.
    To make HDFS Fault Tolerant we have to consider
    a) data node failure
    b) Rack Failure
    So there will be 2 copies in a single rack in two different data nodes and 1 copy in a different rack. So if 1 DN fails, you can get the same data from another DN and let’s say if the entire rack fails you can get the data from the DN from another rack.

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.