How to reduce the data volume during shuffling between Mapper and Reducer Node

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop How to reduce the data volume during shuffling between Mapper and Reducer Node

Viewing 2 reply threads
  • Author
    Posts
    • #5580
      DataFlair TeamDataFlair Team
      Spectator

      In MapReduce output of mapper is shuffled to reducer node. shuffling is the physical movement of data, which is done over the network and is very costly. Hence MapReduce speed is dependent on network bandwidth, if we talk about the optimization of MapReduce job to improve efficiency, we can compress the intermediate output so that the time required to shuffle the data will be minimized. How to configure the compression ?

    • #5583
      DataFlair TeamDataFlair Team
      Spectator

      With the introduction of Mapreduce-2, this feature can be implemented so that the Mapper output/ intermediate output can be compressed with the below command in driver class.

      conf.set(“mapreduce.map.output.compress”, true)

      By researching most references in the internet, it seems that the system actually uses a compression codec format like ‘gzip, snappy, bzip2’ to compress the data (in this case the intermediate output/mapper output). This is actually defined by the property –
      mapred.map.output.compression.codec
      However many are worried if the format stores the data in splitable format, but there is no concern as this is not stored or used directly in/by hdfs. In turn there is another property –

      ‘mapred.output.compression.type’

      which is advised to be ‘Block’, in order to process the data in a splittable format as a input to reducer, regardless the fact of what the codec format was.

    • #5584
      DataFlair TeamDataFlair Team
      Spectator

      Once the map task is completed on a Mapper node, the node starts transferring the sorted map output over the network to the reducer node where the reduce task will be running. At the same time, the mapper node might be running other map tasks as well. The process of transferring the data over the network from the mapper node to reducer node as input is known as shuffling.

      The output produced by the mapper is not directly recorded into the memory. This process involves buffering and processing further to improve efficiency. It is normally a good idea to compress the map output while writing it onto the disk, as it saves disk space and improves performance by optimizing the data volume that is being transferred to the Reducer. But by default output of mapper is not compressed, however, it can be enabled by the following setting

      mapred.compress.map.output to True

      2nd option to improve the performance of data transfer between mapper and reducer is by using the Combiner function. Combiner works as a mini reducer which operate on data generated by mapper and used for the purpose of optimization.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.