Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Hadoop › Ideally what should be replication factor in a Hadoop cluster?
September 20, 2018 at 5:48 pm #6348
I am planning to deploy a Hadoop cluster. What are the factors I should consider while deciding the same.
Suppose I want to deploy a cluster of 100 nodes where each node have 30 TB disk, regarding data volume I am currently getting 3 TB data daily.
How to decide the replication factor ?September 20, 2018 at 5:49 pm #6350
Default Replication Factor is 3. Ideally this should be replication factor.
Taking an example, if in a 3x replication cluster, we plan a maintenance activity on one node out of three nodes, suddenly another node stops working, in that case, we still have a node which is available and makes Hadoop Fault Tolerant.
3x replication also serves the Rack Awareness scenario. So replication factor 3 works perfectly for all situations without over replicating data.
We can change replication factor in hdfs-site.xml .
Basic parameters that can be considered while making the selection of replication factor are:
1.The cost of failure of the node
2.Relative probability of failure for the node
3.Cost of replication
To learn more about replication follow: HDFS TutorialSeptember 20, 2018 at 5:49 pm #6351
The ideal replication factor is considered to be 3. Why so? Below are the reasons-
As we know HDFS is designed to be Fault Tolerance. To keep up with this feature, HDFS replicates the data sent to it for storage so that the data is available. But how will the data be available by just creating replicas?
HDFS ensures that the replicas are stored in such a way that if one DataNodes fails, the data is available in another node. For this, HDFS needs to store copy of data in different node.
But what if the entire Rack fails? For this, HDFS needs to keep data in another Rack.
Hence, ideally HDFS stores one copy of block of data in one node of one Rack and other two copies in different nodes of different Rack. This ensures Fault tolerance and High availability.September 20, 2018 at 5:49 pm #6352
It is basically the number of times Hadoop framework replicate each and every Data Block. Block is replicated to provide Fault Tolerance. The default replication factor is 3 which can be configured as per the requirement; it can be changed to 2 (less than 3) or can be increased (more than 3.).
Increasing/Decreasing the replication factor in HDFS has a impact on Hadoop cluster performance.
Now lets think of increasing the replication factor from default 3 to 5,as the replication factor increases, the NameNode needs to store the more metadata about the replicated copies.
So at one particular time NameNode might face heavy load on it, and even it might not be able to process the request ASAP. Also we needsto consider the NameNode’s capacity(like whether its a HDD or SSD, Processor etc.,) in order to set the replication factor.
So as the replication factor increases, we will get better data reliability, which is good but at the cost of performance. If you have enough resources(storage and processing) then increasing the replication factor will certainly benefit.
So what will be an ideal factor?
Replication factor 3 means you can afford to go one node go out of commission and still have fault tolerance with 2 nodes left. This mean your support staff has the alert that one of the node is out of commission and need to be responded quickly. If in the mean while another node goes down then you still have one node functioning and your mission critical application still working but on a very critical stage where you cannot afford to loss the last node. By now your support staff who was already altered when first node was down, with second node down they know how critical the situation is and responding accordingly.
You must be logged in to reply to this topic.