In Hadoop, when we run MapReduce job on the large dataset, so large chunks of intermediate data is generated by the Mapper and this intermediate data is passed to the reducer for further processing, which leads to enormous network congestion. MapReduce framework provides a function known as Combiner that plays a key role in reducing network congestion.
The Combiner is also known as Mini-reducer. Combiner performs local aggregation on the mappers output, which helps to minimize the data transfer between mapper and reducer. Thus, increase the efficiency of a MapReduce program.
The execution of combiner is not guaranteed; Hadoop may or may not execute a combiner. Also if required it may execute it more than 1 times. Hence, your MapReduce jobs should not depend on the Combiners execution. Number of Combiner
We cannot hardcode/direct to the framework for the number of the combiner to run for a given MapReduce job as we do it for Reducer.
Combiner runs on each mapper node. The first rule of MapReduce Combiners is: do not assume that the combiner will run and treat the combiner only as an optimization technique.
Each map task has a memory buffer that it writes the output to. The buffer is 100 MB by default and when the contents of the buffer reach a certain threshold size say 80%, a background process will start to spill the contents to disk. In some cases when the data doesn’t need to be spilled to disk, MapReduce will skip using the Combiner entirely. Note also that the Combiner may be run multiple times over subsets of the data without affecting the final result.
By default, if there are at least 3 spill files the combiner will execute.
Through a (min.num.spills.for.combine) property the number of spills for which a combiner need to run can be set.