Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › Which compression is used in Hadoop? Which of them is splittable?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
Viewing 1 reply thread
-
AuthorPosts
-
-
September 20, 2018 at 12:33 pm #4870DataFlair TeamSpectator
Explain Data Compression in Hadoop?
-
September 20, 2018 at 12:33 pm #4872DataFlair TeamSpectator
The four most widely used Compression formats in Hadoop are as follows:
1) GZIP
- Provides High compression ratio.
- Uses high CPU resources to compress and decompress data.
- Good choice for Cold data which is infrequently accessed.
- Compressed data is not splittable and hence not suitable for MapReduce jobs.
2) BZIP2
- Provides High compression ratio (even higher than GZIP).
- Takes long time to compress and decompress data.
- Good choice for Cold data which is infrequently accessed.
- Compressed data is splittable.
- Even though the compressed data is splittable, it is generally not suited for MR jobs because of high compression/decompression time.
3) LZO
- Provides Low compression ratio.
- Very fast in compressing and decompressing data.
- Compressed data is splittable if an appropriate indexing algorithm is used.
- Best suited for MR jobs because of property (ii) and (iii).
4) SNAPPY
- Provides average compression ratio.
- Aimed at very fast compression and decompression time.
- Compressed data is not splittable if used with normal file like .txt
- Generally used to compress Container file formats like Avro and SequenceFile because the files inside a Compressed Container file can be split.
-
-
AuthorPosts
Viewing 1 reply thread
- You must be logged in to reply to this topic.