Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › How to reduce the data volume during shuffling between Mapper and Reducer Node
- This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 3:42 pm #5580DataFlair TeamSpectator
In MapReduce output of mapper is shuffled to reducer node. shuffling is the physical movement of data, which is done over the network and is very costly. Hence MapReduce speed is dependent on network bandwidth, if we talk about the optimization of MapReduce job to improve efficiency, we can compress the intermediate output so that the time required to shuffle the data will be minimized. How to configure the compression ?
-
September 20, 2018 at 3:42 pm #5583DataFlair TeamSpectator
With the introduction of Mapreduce-2, this feature can be implemented so that the Mapper output/ intermediate output can be compressed with the below command in driver class.
conf.set(“mapreduce.map.output.compress”, true)
By researching most references in the internet, it seems that the system actually uses a compression codec format like ‘gzip, snappy, bzip2’ to compress the data (in this case the intermediate output/mapper output). This is actually defined by the property –
mapred.map.output.compression.codec
However many are worried if the format stores the data in splitable format, but there is no concern as this is not stored or used directly in/by hdfs. In turn there is another property –‘mapred.output.compression.type’
which is advised to be ‘Block’, in order to process the data in a splittable format as a input to reducer, regardless the fact of what the codec format was.
-
September 20, 2018 at 3:42 pm #5584DataFlair TeamSpectator
Once the map task is completed on a Mapper node, the node starts transferring the sorted map output over the network to the reducer node where the reduce task will be running. At the same time, the mapper node might be running other map tasks as well. The process of transferring the data over the network from the mapper node to reducer node as input is known as shuffling.
The output produced by the mapper is not directly recorded into the memory. This process involves buffering and processing further to improve efficiency. It is normally a good idea to compress the map output while writing it onto the disk, as it saves disk space and improves performance by optimizing the data volume that is being transferred to the Reducer. But by default output of mapper is not compressed, however, it can be enabled by the following setting
mapred.compress.map.output to True
2nd option to improve the performance of data transfer between mapper and reducer is by using the Combiner function. Combiner works as a mini reducer which operate on data generated by mapper and used for the purpose of optimization.
-
-
AuthorPosts
- You must be logged in to reply to this topic.