How to compress the data in HDFS

Viewing 2 reply threads
  • Author
    Posts
    • #6062
      DataFlair TeamDataFlair Team
      Spectator

      How to compress the data in HDFS and what are the benefits of the compression. Can we process compressed data using mapreduce or we need to uncompress it before processing ? Which is the best compression codec ?

    • #6065
      DataFlair TeamDataFlair Team
      Spectator

      How to Compress the data in HDFS and its benefits of Compression.
      In Hadoop we are storing large files. Compressing the data reduces the file size. Hence low disk usage.
      Although Hadoop can be run on commodity hardware still the disk space is not free.
      Another advantage is it speeds up the data transfer/movement of data in the cluster.
      Both of this saving are significant but one need to carefully decide on whether compression is needed as per the dataset available. Like if you have already compressed data or jpeg image file the compression data will be larger than the original data, in such scenario compression is not recommended.

      There are various compression schemes called codecs available in Hadoop. Below are the list

      Gzip: Create file with .gzextension. gunzip command is used to decompress it.
      binzip2: Better compression than gzip but very slow. Of all the codec available in Hadoop binzip2 is the slowest. Use only when setting up archieve which will used rarely and disk space is a concern.
      Snappy: Fastest decompression speed. Recommended for data that will be queried often.
      LZO: Similar to snappy, modest compression ration but fast compression and decompression speed. Supports splittable compression that enables parallel processing of compressed text file splits by map reduce jobs.
      Can we process compressed data using mapreduce or we need to uncompress it before processing ?
      Yes, we can process compressed data using mapreduce. MapReduce will automatically decompress while reading the data using the file name extension to determine which codec to use.

      Which is the best compression codec ?
      Choosing a Data compression format depends on the use case(cluster design, data). For example: if speed of processing is important we need to use with high decompression speeds.

    • #6067
      DataFlair TeamDataFlair Team
      Spectator

      File compression helps to reduce the size of the file. Hence, in Hadoop, it reduces the space needed for storing large files.
      Another advantage of compression is that it speeds up data transfer across the network in the cluster.
      While dealing with large volumes of files, these savings can make a significant difference.
      We have to decide whether compression is required or not depending on the data available. For eg:, if the data available is already compressed, then compressing the data will make the size increase even more compared with the available data. In such cases, compression is not recommended.
      Hadoop provides various compression tools called ‘codecs’. A codec is the implementation of a compression-decompression algorithm.
      Below are the codecs specifications:

      GZIP: compresses and creates a file with .gz extension. gunzip command is used to unzip it. It provides a higher compression ratio(in terms of size).
      BZip2: It provides more compression ratio than GZIP, but it’s very slow. It is the slowest of all codecs in Hadoop. Used only for archival purpose which is
      rarely used, and disk space is a concern.
      Snappy: It provides faster compression/decompression rate. Hence, better choice for hot data which is accessed frequently.
      LZO: It’s same as Snappy, like medium compression ratio, but faster compression/decompression rate. In addition to this, it supports splittable compression upon
      indexed, which provides parallel computing of compressed text file splits via MapReduce jobs.

      Can we process compressed data using MapReduce or we need to uncompress it before processing ?
      Yes, we can process compressed data using mapreduce.
      If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the
      filename extension to determine which codec to use. For example, a file ending in .gz can be identified as gzip-compressed
      file and thus read with GzipCodec.

      Which is the best compression codec ?
      Choosing the best compression codec depends on our needs/usecase, like data, space or speed.
      For eg:, if size is the concern, we can proceed with more compression ratio codecs.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.