How does HDFS ensure Data Integrity of data blocks stored in HDFS?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop How does HDFS ensure Data Integrity of data blocks stored in HDFS?

Viewing 2 reply threads
  • Author
    Posts
    • #5676
      DataFlair TeamDataFlair Team
      Spectator

      How is Data Integrity achieved in HDFS?
      Is Data Integrated in HDFS?

    • #5679
      DataFlair TeamDataFlair Team
      Spectator

      Data Integrity ensures the correctness of the data. However, it is possible that the data will get corrupted during I/O operation on the disk. Corruption can occur due to various reasons like faults in a storage device, network faults, or buggy software.Hadoop HDFS framework implements checksum checking on the contents of HDFS files. In Hadoop, when a client creates an HDFS file, it computes a checksum of each block of file and stores these checksums in a separate hidden file in the same HDFS namespace.

      HDFS client, when retrieves file contents, it first verifies that the data it received from each Datanode matches the checksum stored in the associated checksum file. And if not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.

    • #5682
      DataFlair TeamDataFlair Team
      Spectator

      1) Data Integrity means to make sure that no data is lost or corrupted during storage or processing of the Data.

      2) Since in Hadoop, amount of data being written or read is large in Volume, chances of data corruption is more.

      3) So in Hadoop checksum is computed when data written to the disk for the first time and again checked while reading data from the disk. If checksum matches the original checksum then it is said that data is not corrupted otherwise it is said to be corrupted.

      4) Its just data detection error.

      5) It is possible that it’s the checksum that is corrupt, not the data, but this is very unlikely, because the checksum is much smaller than the data

      6) HDFS uses a more efficient variant called CRC-32C to calculate checksum.

      7) DataNodes are responsible for verifying the data they receive before storing the data and its checksum. Checksum is computed for the data that they receive from clients and from other DataNodes during replication

      8) Hadoop can heal the corrupted data by copying one of the good replica to produce the new replica which is uncorrupt replica.

      9) if a client detects an error when reading a block, it reports the bad block and the DataNodes it was trying to read from to the NameNode before throwing a Checksum Exception.

      10) The NameNode marks the block replica as corrupt so it doesn’t direct any more clients to it or try to copy this replica to another DataNodes.

      11) It provides copy of the block another DataNodes which is to be replicated, so its replication factor is back at the expected level.

      12) Once this has happened, the corrupt replica is deleted

Viewing 2 reply threads
  • You must be logged in to reply to this topic.