Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Hadoop Why HDFS performs replication, although it results in data redundancy?

This topic contains 4 replies, has 1 voice, and was last updated by  dfbdteam3 1 year, 6 months ago.

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
  • #5737


    Why HDFS performs replication, although it leads to consumption of lot of space?



    Replication is implemented to make HDFS more Reliable and Fault Tolerant.

    There could be situations where the data is lost in many ways- node is down, Node lost the network connectivity, a node is physically damaged, a node is intentionally made unavailable for horizontal scaling.
    For any of the above-mentioned reasons, data will not be available if the replication is not made. HDFS usually maintains 3 copies of each Data Block in different nodes and different Racks. By doing this, data is made available even if one of the systems is down.
    In some cases, the entire rack will be down due to some issues even then data could be retrieved from a node on a different rack.
    Downtime will be reduced by making data replications. This improves the reliability and makes HDFs fault tolerant. Nowadays space is available at cheap prices so the industry is not worried about maintaining multiple copies of data.



    HDFS performs replication to provide Fault Tolerant and to improve data reliability.
    HDFS is all about data so one can not afford data loss in any circumstances.
    Data could be unavailable because of any of the following reasons,
    1) Node is down,
    2) Node lost the network connectivity,
    3) Node is physically damaged,
    4) Node is intentionally made unavailable for horizontal scaling.



    Replicas is critical to HDFS for its reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems.

    This improves:
    1. Data reliability
    2. Availability
    3. Network bandwidth utilization.

    The default Replication factor is 3.

    1) The first replica is written to the data node creating the file, to improve the write performance because of the write affinity.
    2) The second replica is written to another data node within the same rack, to minimize the cross-rack network traffic.
    3) The third replica is written to a data node in a different rack, ensuring that even if a switch or rack fails, the data is not lost (Rack awareness).

    As HDFS is designed to run on commodity hardware and in order to increase the storage capacity we use Data Scalability.
    The performance of the processing data will be very high compared to those running on a single system (Vertical Scaling)



    Hadoop is known to handle large volumes of data but storing data into cluster, it is known for Data Reilability , Data availability, Scalable.
    Data remains Highly available and reliable and the important factor for this feature is Replication and Rack awareness.

    Once the data is written in HDFS it is immediately replicated along the cluster, so that different copies of data will be stored on different data nodes. Normally the Replication factor is 3 as due to this the data does not remain over replicated nor it is less.

    Data is replicated on the factor that “Not more the 1 copy on a data node and no more than 2 copies in a Rack”. so even if a data node goes down the copy of the block would be either present on some other node in the same rack or some other Rack and due to this data always remains available.

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.