Which compression is used in Hadoop? Which of them is splittable?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop Which compression is used in Hadoop? Which of them is splittable?

Viewing 1 reply thread
  • Author
    Posts
    • #4870
      DataFlair TeamDataFlair Team
      Spectator

       

      Explain Data Compression in Hadoop?

       

    • #4872
      DataFlair TeamDataFlair Team
      Spectator

      The four most widely used Compression formats in Hadoop are as follows:

      1) GZIP

      • Provides High compression ratio.
      • Uses high CPU resources to compress and decompress data.
      • Good choice for Cold data which is infrequently accessed.
      • Compressed data is not splittable and hence not suitable for MapReduce jobs.

      2) BZIP2

      • Provides High compression ratio (even higher than GZIP).
      • Takes long time to compress and decompress data.
      • Good choice for Cold data which is infrequently accessed.
      • Compressed data is splittable.
      • Even though the compressed data is splittable, it is generally not suited for MR jobs because of high compression/decompression time.

      3) LZO

      • Provides Low compression ratio.
      • Very fast in compressing and decompressing data.
      • Compressed data is splittable if an appropriate indexing algorithm is used.
      • Best suited for MR jobs because of property (ii) and (iii).

      4) SNAPPY

      • Provides average compression ratio.
      • Aimed at very fast compression and decompression time.
      • Compressed data is not splittable if used with normal file like .txt
      • Generally used to compress Container file formats like Avro and SequenceFile because the files inside a Compressed Container file can be split.
Viewing 1 reply thread
  • You must be logged in to reply to this topic.