Ideally what should be the replication factor in Hadoop Cluster ?

This topic has 4 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 4 reply threads

Author

Posts
- September 20, 2018 at 3:36 pm #5547
  
  DataFlair Team
  Spectator
  
  Ideally what should be the replication factor in Hadoop Cluster ?
- September 20, 2018 at 3:36 pm #5549
  
  DataFlair Team
  Spectator
  
  Replication factor is basically the no.of times we are going to replicate every single Data Block. So, in Hadoop, we have replication factor by default as 3, and the replication in hadoop is not the drawback, in fact it makes hadoop effective and efficient by incorporating the feature like Fault Tolerant.
  
  There is a flexibility to change the replication factor in hadoop, i.e it can be changed to 2(less than 3) or can be increased(more than 3). However it is considered ideally to have replication factor as 3, because:
  
  If one node of your’s goes down, you still have fault tolerant with 2 nodes and your critical data is saved in these two nodes successfully.
  Also, you have ample time to send an alert to name node and recover the duplication of the failed node into a new node.
  And in the meantime, if the 2nd node also fails unplanned, you still have one node active with your critical data to process.
  Hence replication factor 3 is considered to best fit, less than that could be challenging during data recovery, and higher no of the node are known as cost prone.
  
  To learn more about Replication follow: Hadoop Features
- September 20, 2018 at 3:36 pm #5551
  
  DataFlair Team
  Spectator
  
  The default replication factor is 3 in Hadoop.
  
  An ideal replication factor is 3 for the following reasons :
  
  1) Hadoop is used in clustered environment where you have clusters, each cluster will have multiple racks, each rack will have multiple datanodes.
  
  2) To make HDFS Fault Tolerant we need to consider: Datanode failure and Rack failure.
  Since replica placement strategy is in the case of replication factor 3 tries to keep 2 copies of replicas in the same Rack and the remaining copy in other Rack.
  
  So in above cases, we need to make sure that –
  
  If one DataNode fails, you can get the same data from another DataNode
  IIf the entire Rack fails, you can get the same data from another Rack.
  Due to these, we fulfill the fault tolerant criteria.
  
  However, Hadoop is flexible in changing replication factor from configuration file hdfs-site.xml with dfs.replication property.
- September 20, 2018 at 3:36 pm #5552
  
  DataFlair Team
  Spectator
  
  Default Replication Factor is 3. Ideally this should be replication factor.
  Taking an example, if in a 3x replication cluster, we plan a maintenance activity on one node out of three nodes, suddenly another node stops working, in that case, we still have a node which is available and makes Hadoop Fault Tolerant.
  3x replication also serves the Rack Awareness scenario. So replication factor 3 works perfectly for all situations without over replicating data.
  We can change replication factor in hdfs-site.xml
- September 20, 2018 at 3:36 pm #5553
  
  DataFlair Team
  Spectator
  
  Default replication factor is 3.
  To make HDFS Fault Tolerant we have to consider
  a) data node failure
  b) Rack Failure
  So there will be 2 copies in a single rack in two different data nodes and 1 copy in a different rack. So if 1 DN fails, you can get the same data from another DN and let’s say if the entire rack fails you can get the data from the DN from another rack.
Author

Posts

Viewing 4 reply threads

You must be logged in to reply to this topic.

Ideally what should be the replication factor in Hadoop Cluster ?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses