Which compression is used in Hadoop? Which of them is splittable?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 12:33 pm #4870
  
  DataFlair Team
  Spectator
  
  Explain Data Compression in Hadoop?
- September 20, 2018 at 12:33 pm #4872
  DataFlair Team
  Spectator
  The four most widely used Compression formats in Hadoop are as follows:
  
  1) GZIP
  - Provides High compression ratio.
  - Uses high CPU resources to compress and decompress data.
  - Good choice for Cold data which is infrequently accessed.
  - Compressed data is not splittable and hence not suitable for MapReduce jobs.
  2) BZIP2
  - Provides High compression ratio (even higher than GZIP).
  - Takes long time to compress and decompress data.
  - Good choice for Cold data which is infrequently accessed.
  - Compressed data is splittable.
  - Even though the compressed data is splittable, it is generally not suited for MR jobs because of high compression/decompression time.
  3) LZO
  - Provides Low compression ratio.
  - Very fast in compressing and decompressing data.
  - Compressed data is splittable if an appropriate indexing algorithm is used.
  - Best suited for MR jobs because of property (ii) and (iii).
  4) SNAPPY
  - Provides average compression ratio.
  - Aimed at very fast compression and decompression time.
  - Compressed data is not splittable if used with normal file like .txt
  - Generally used to compress Container file formats like Avro and SequenceFile because the files inside a Compressed Container file can be split.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.