Explain Erasure Coding in Hadoop?

Viewing 1 reply thread
  • Author
    Posts
    • #4851
      DataFlair TeamDataFlair Team
      Spectator

      What is Erasure Coding in Hadoop HDFS?
      Why Erasure Coding is needed in Hadoop?

    • #4852
      DataFlair TeamDataFlair Team
      Spectator

      Erasure Coding is a technique to reduce disk usage in Hadoop which is under development (alpha phase) for Hadoop 3.0.

      In order to understand what Erasure is Coding? first, we need to see what are its driving factors:

      In the current version of Hadoop (i.e) 2.x, the default Replication Factor is 3. So for each Data Block, there exist 2 other replicated blocks thus increasing the storage overhead by 200%. This is further worsened by the fact that more than 80 percent of raw data that is stored in Hadoop is Cold Data (i.e.) data which is irrelevant and of no use (Remember the 4th V – Veracity). In a nutshell, the current HDFS architecture is not at all storage friendly.

      Solution:
      In order to solve the above problem, Erasure Coding (EC) technique will come as a savior in Hadoop 3.0 HDFS architecture which provides the same level of Fault Tolerance with much less storage usage.

      Let’s take one example:

      Suppose we have a file which consists of 2 Blocks (B1 and B2).

      1) With current HDFS setup, we will have total (2×3 = 6 blocks in total).
      For Block B1 -> B1.1, B1.2, B1.3
      For Block B2 -> B2.1, B2.2, B2.3

      2) With EC setup, we will have total (2×2 + 2/2 = 5 blocks in total).
      For Block B1 -> B1.1, B1.2
      For Block B2 -> B2.1, B2.2
      The 3rd Copy of each Block will be Xor’ed together and stored as a single Parity Block as (B1.1 xor B2.1) -> Bp

      In this setup:
      If B1.1 is corrupted, we can recompute B1.1 = Bp xor B2.1
      If B2.1 is corrupted, we can recompute B2.1 = Bp xor B1.1
      If both B1.1 and B2.1 are corrupted, then we have another copy of both the blocks (B1.2 and B2.2)
      If parity Block Bp is corrupted, then it is again recomputed as B1.1 xor B2.1

      In general, in EC Technique, if the Replication factor is (R), then for (N) number of Blocks, total Blocks will be calculated as:

      N*(R-1) + (N/2)

      Example Storage Savings (with File Size 100 TB and 128 MB Block size):

      • Hadoop 2.x : Total Blocks = 3 * (100*1024*1024 / 128) = 2457600
      • Hadoop 3.0 : Total Blocks = 2 * (100*1024*1024 / 128) + (100*1024*1024 / 256) = 1638400 + 409600 = 2048000<l/i>

      That’s a whopping 409600 Blocks of Storage Savings (i.e.) 50 TB of data for storage of a single File. So imagine how much can be saved in total!!!

      Follow the link to learn more about Erasure Coding in Hadoop

Viewing 1 reply thread
  • You must be logged in to reply to this topic.