How to enable/configure the compression of map output data in hadoop?

Job-ready Courses with Certificates – Learn Today. Lead Tomorrow. Forums Apache Hadoop How to enable/configure the compression of map output data in hadoop?

Viewing 1 reply thread
  • Author
    Posts
    • #5110
      DataFlair Team
      Spectator

      In MapReduce output of mapper is shuffled to reducer node. shuffling is the physical movement of data, which is done over the network and is very costly. Hence MapReduce speed is dependent on network bandwidth, if we talk about the optimization of MapReduce job to improve efficiency, we can compress the intermediate output so that the time required to shuffle the data will be minimized. How to configure the compression?

    • #5113
      DataFlair Team
      Spectator

      We can use four techniques for CODEC(Compression and Decompression) in Hadoop.

      1) LZO– Very fast decompression and reasonable compression

      2) GZIP– Reasonable decompression and reasonable compression

      3) Snappy– Faster Compression and faster decompression formats. Less efficient in terms of compression ratio.

      4) bGIP2

      LZO, GZIP and Snappy compress and decompress files in normal formats while in bGIP2 compression is done splittable format i.e divide the data in no of programs.

      Implement these CODEC
      1) Set “mapred.output.compress” property as true.

      hadoop-2.5.0-cdh5.3.2/etc/hadoop/mapred-site.xml
      <property>mapred.output.compress</property>
      <value>true</value>

      By default, in “maped.output.compression.codec” property “org.apache.hadoop.io.compress.DefaultCodec” is set as soon as you do 1st step. You can change “DefaultCodec” value with your choice of CODEC e.g.LZOCodec.

      2) We can even write our own algorithm for CODEC.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.