How to compress the data in HDFS

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 5:02 pm #6062
  
  DataFlair Team
  Spectator
  
  How to compress the data in HDFS and what are the benefits of the compression. Can we process compressed data using mapreduce or we need to uncompress it before processing ? Which is the best compression codec ?
- September 20, 2018 at 5:02 pm #6065
  
  DataFlair Team
  Spectator
  
  How to Compress the data in HDFS and its benefits of Compression.
  In Hadoop we are storing large files. Compressing the data reduces the file size. Hence low disk usage.
  Although Hadoop can be run on commodity hardware still the disk space is not free.
  Another advantage is it speeds up the data transfer/movement of data in the cluster.
  Both of this saving are significant but one need to carefully decide on whether compression is needed as per the dataset available. Like if you have already compressed data or jpeg image file the compression data will be larger than the original data, in such scenario compression is not recommended.
  
  There are various compression schemes called codecs available in Hadoop. Below are the list
  
  Gzip: Create file with .gzextension. gunzip command is used to decompress it.
  binzip2: Better compression than gzip but very slow. Of all the codec available in Hadoop binzip2 is the slowest. Use only when setting up archieve which will used rarely and disk space is a concern.
  Snappy: Fastest decompression speed. Recommended for data that will be queried often.
  LZO: Similar to snappy, modest compression ration but fast compression and decompression speed. Supports splittable compression that enables parallel processing of compressed text file splits by map reduce jobs.
  Can we process compressed data using mapreduce or we need to uncompress it before processing ?
  Yes, we can process compressed data using mapreduce. MapReduce will automatically decompress while reading the data using the file name extension to determine which codec to use.
  
  Which is the best compression codec ?
  Choosing a Data compression format depends on the use case(cluster design, data). For example: if speed of processing is important we need to use with high decompression speeds.
- September 20, 2018 at 5:03 pm #6067
  
  DataFlair Team
  Spectator
  
  File compression helps to reduce the size of the file. Hence, in Hadoop, it reduces the space needed for storing large files.
  Another advantage of compression is that it speeds up data transfer across the network in the cluster.
  While dealing with large volumes of files, these savings can make a significant difference.
  We have to decide whether compression is required or not depending on the data available. For eg:, if the data available is already compressed, then compressing the data will make the size increase even more compared with the available data. In such cases, compression is not recommended.
  Hadoop provides various compression tools called ‘codecs’. A codec is the implementation of a compression-decompression algorithm.
  Below are the codecs specifications:
  
  GZIP: compresses and creates a file with .gz extension. gunzip command is used to unzip it. It provides a higher compression ratio(in terms of size).
  BZip2: It provides more compression ratio than GZIP, but it’s very slow. It is the slowest of all codecs in Hadoop. Used only for archival purpose which is
  rarely used, and disk space is a concern.
  Snappy: It provides faster compression/decompression rate. Hence, better choice for hot data which is accessed frequently.
  LZO: It’s same as Snappy, like medium compression ratio, but faster compression/decompression rate. In addition to this, it supports splittable compression upon
  indexed, which provides parallel computing of compressed text file splits via MapReduce jobs.
  
  Can we process compressed data using MapReduce or we need to uncompress it before processing ?
  Yes, we can process compressed data using mapreduce.
  If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the
  filename extension to determine which codec to use. For example, a file ending in .gz can be identified as gzip-compressed
  file and thus read with GzipCodec.
  
  Which is the best compression codec ?
  Choosing the best compression codec depends on our needs/usecase, like data, space or speed.
  For eg:, if size is the concern, we can proceed with more compression ratio codecs.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

How to compress the data in HDFS

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses