How to reduce the data volume during shuffling between Mapper and Reducer Node

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 3:42 pm #5580
  
  DataFlair Team
  Spectator
  
  In MapReduce output of mapper is shuffled to reducer node. shuffling is the physical movement of data, which is done over the network and is very costly. Hence MapReduce speed is dependent on network bandwidth, if we talk about the optimization of MapReduce job to improve efficiency, we can compress the intermediate output so that the time required to shuffle the data will be minimized. How to configure the compression ?
- September 20, 2018 at 3:42 pm #5583
  
  DataFlair Team
  Spectator
  
  With the introduction of Mapreduce-2, this feature can be implemented so that the Mapper output/ intermediate output can be compressed with the below command in driver class.
  
  conf.set(“mapreduce.map.output.compress”, true)
  
  By researching most references in the internet, it seems that the system actually uses a compression codec format like ‘gzip, snappy, bzip2’ to compress the data (in this case the intermediate output/mapper output). This is actually defined by the property –
  mapred.map.output.compression.codec
  However many are worried if the format stores the data in splitable format, but there is no concern as this is not stored or used directly in/by hdfs. In turn there is another property –
  
  ‘mapred.output.compression.type’
  
  which is advised to be ‘Block’, in order to process the data in a splittable format as a input to reducer, regardless the fact of what the codec format was.
- September 20, 2018 at 3:42 pm #5584
  
  DataFlair Team
  Spectator
  
  Once the map task is completed on a Mapper node, the node starts transferring the sorted map output over the network to the reducer node where the reduce task will be running. At the same time, the mapper node might be running other map tasks as well. The process of transferring the data over the network from the mapper node to reducer node as input is known as shuffling.
  
  The output produced by the mapper is not directly recorded into the memory. This process involves buffering and processing further to improve efficiency. It is normally a good idea to compress the map output while writing it onto the disk, as it saves disk space and improves performance by optimizing the data volume that is being transferred to the Reducer. But by default output of mapper is not compressed, however, it can be enabled by the following setting
  
  mapred.compress.map.output to True
  
  2nd option to improve the performance of data transfer between mapper and reducer is by using the Combiner function. Combiner works as a mini reducer which operate on data generated by mapper and used for the purpose of optimization.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

How to reduce the data volume during shuffling between Mapper and Reducer Node

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses