What is Erasure Coding in Hadoop?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 5:29 pm #6247
  
  DataFlair Team
  Spectator
  
  What is Erasure Coding in Hadoop HDFS?
  Why is Erasure Coding needed in Hadoop?
- September 20, 2018 at 5:29 pm #6248
  
  DataFlair Team
  Spectator
  
  In HDFS as we all know replication factor by default is 3 which creates 200% overhead on the hdfs storage. Sometimes that 200% overhead is being used rarely . Hence Erasure coding plays an important role of implementing only 50% overhead and still the data is accessible.
  
  Replication is very expensive though the same data blocks created repeatedly to prevent fault tolerance and they are not being consumed for some I/O activity during normal operation.
  
  Erasure coding helps saving disk space and hence achieves efficiency. fault tolerance is still maintained.
  
  Follow the link for more detail: Erasure coding in Hadoop
- September 20, 2018 at 5:29 pm #6251
  
  DataFlair Team
  Spectator
  
  Erasure coding is needed primary because of the space utilized by HDFS for replication.HDFS by default replicates each block three times to have fault tolerance.It also reduce the computational task using locally store multiple replicas of each block. But replication is very expensive and 3 x replication has 200% overhead in storage space and other resources(network and bandwidth). datasets with low I/O activity , replicas are rarely accessed during the normal operation but still consume other resources.Therefore, a new feature erasure coding to be used in the place of replication which provides same level of fault tolerance with less space store and 50% storage overhead.In considering the different storage scheme. an important consideration is the data durability and storage efficiency. SO in N-way replication, there is N-1 fault
  tolerance with 1/N storage efficiency.
  
  Erasure coding uses RAID – redundant array of inexpensive Disks. RAID implements EC through striping, in which the logically sequential data is divided into smaller units and stores consecutive units on different disks. Then for each stripe of original data cells, a certain number of parity cells are calculated and stored. This process is called encoding. The error on any stripping cell can be recovered through decoding calculation based on surviving data cells and parity cells. The data
  cells and parity cells together are called an erasure coding group.
  
  There are two algorithms available for erasure coding.
  
  XOR algorithm and Reed-Solomon Algorithm.
  
  XOR algorithm: It is the simplest form of erasure coding. If X,Y and Z are data cell then parity cell is XOR of these three data cells x+Y+Z. So only 1 parity bit is generated and if any one bit is lost it can be recovered by remaining data cells and a parity bit. So the fault tolerance is only 1.
  
  Reed-Solomon Alogorithm: The limitation of XOR algorithm is addressed by RS. RS uses two parameteres K and M. K is the number of data cells and M is the number of parity cells.RS works by multiplying K data cells with a generator matrix to generate extended codeword with K data cells and M parity cells. The fault tolerance is upto M cells.
  
  Follow the link for more detail: Erasure coding in Hadoop
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What is Erasure Coding in Hadoop?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses