What is Erasure Coding in Hadoop?

Viewing 2 reply threads
  • Author
    Posts
    • #6247
      DataFlair TeamDataFlair Team
      Spectator

      What is Erasure Coding in Hadoop HDFS?
      Why is Erasure Coding needed in Hadoop?

    • #6248
      DataFlair TeamDataFlair Team
      Spectator

      In HDFS as we all know replication factor by default is 3 which creates 200% overhead on the hdfs storage. Sometimes that 200% overhead is being used rarely . Hence Erasure coding plays an important role of implementing only 50% overhead and still the data is accessible.

      Replication is very expensive though the same data blocks created repeatedly to prevent fault tolerance and they are not being consumed for some I/O activity during normal operation.

      Erasure coding helps saving disk space and hence achieves efficiency. fault tolerance is still maintained.

      Follow the link for more detail: Erasure coding in Hadoop

    • #6251
      DataFlair TeamDataFlair Team
      Spectator

      Erasure coding is needed primary because of the space utilized by HDFS for replication.HDFS by default replicates each block three times to have fault tolerance.It also reduce the computational task using locally store multiple replicas of each block. But replication is very expensive and 3 x replication has 200% overhead in storage space and other resources(network and bandwidth). datasets with low I/O activity , replicas are rarely accessed during the normal operation but still consume other resources.Therefore, a new feature erasure coding to be used in the place of replication which provides same level of fault tolerance with less space store and 50% storage overhead.In considering the different storage scheme. an important consideration is the data durability and storage efficiency. SO in N-way replication, there is N-1 fault
      tolerance with 1/N storage efficiency.

      Erasure coding uses RAID – redundant array of inexpensive Disks. RAID implements EC through striping, in which the logically sequential data is divided into smaller units and stores consecutive units on different disks. Then for each stripe of original data cells, a certain number of parity cells are calculated and stored. This process is called encoding. The error on any stripping cell can be recovered through decoding calculation based on surviving data cells and parity cells. The data
      cells and parity cells together are called an erasure coding group.

      There are two algorithms available for erasure coding.

      XOR algorithm and Reed-Solomon Algorithm.

      XOR algorithm: It is the simplest form of erasure coding. If X,Y and Z are data cell then parity cell is XOR of these three data cells x+Y+Z. So only 1 parity bit is generated and if any one bit is lost it can be recovered by remaining data cells and a parity bit. So the fault tolerance is only 1.

      Reed-Solomon Alogorithm: The limitation of XOR algorithm is addressed by RS. RS uses two parameteres K and M. K is the number of data cells and M is the number of parity cells.RS works by multiplying K data cells with a generator matrix to generate extended codeword with K data cells and M parity cells. The fault tolerance is upto M cells.

      Follow the link for more detail: Erasure coding in Hadoop

Viewing 2 reply threads
  • You must be logged in to reply to this topic.