Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Hadoop › How many times combiner is called on a mapper node?
September 20, 2018 at 4:11 pm #5780
How many times a combiner is called on a mapper node for a specific MapReduce Job ?
Can we control the execution of a combiner? Can we specify how many combiners should run?September 20, 2018 at 4:12 pm #5782
When we run MapReduce job on a large dataset, so Mapper generates large chunks of intermediate data and framework pass this intermediate data to the Reducer for further processing. This leads to enormous network congestion.
The MapReduce framework provides a function known as Combiner that plays a vital role in reducing network congestion. The Combiner is also called as Mini-reducer.
In MapReduce job, Combiner does local aggregation on the mapper output. This helps to minimize the data transfer between mapper and reducer. Therefore, increase the efficiency of a MapReduce program.
The execution of combiner is not guaranteed; Hadoop may or may not execute a combiner. Also if required it may execute it more than 1 times. So, MapReduce jobs should not depend on the Combiners execution.
Combiner can be executed zero, one or many times, so that a given mapreduce job should not depend on the combiner executions and should always produce the same results. It also depends on the size of particular intermediate map output and a value of min.num.spills.for.combine property
By default 3 spill files produced by a mapper are needed for the combiner to run, if combiner is specified.
Follow the link to learn more about Combiner in HadoopSeptember 20, 2018 at 4:12 pm #5783
Combiner runs if the spills are greater than minSpillsForCombine. The minSpillForCombine is driven by property “mapreduce.map.combine.minspills” whose default value is 3.
With default value 3, combiner only runs if there are more than 3 spill files written to the disk.
Recall that combiners may be run repeatedly over the input without affecting the final result. If there are only one or two spills, then the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.
A combiner is never run for map-only jobs.
We cannot hard code / direct to the framework for the number of the combiner to run for a given MapReduce job.September 20, 2018 at 4:12 pm #5785
Invoking of a Combiner function totally depends on the size of the input file,larger the input data,large will be the intermediate output from Mapper.
Now if this entire output will be send directly to reducer,it will take more time to process this large amount of data.
So,basically combiner is invoked to reduce the intermediate output from mapper ,so that reducer has to process less data and thus give final output in less time.
Now the no of combiners is not predefined.It can be 0 or multiple,depending on the size of data.
Follow the link to learn more about Combiner in HadoopSeptember 20, 2018 at 4:12 pm #5787
Combiner also termed as mini-reducer processes the intermediate output of the mapper before passing it to the reducer.
When the MapReduce job runs on the large data set, a mapper generates a large chunk of intermediate output which is transferred to reducer over the network and causes a network congestion. Combiner helps in reducing the network congestion by reducing the number of data transferred to the reducer, Combiner takes input from the mapper and summarizes the output based on the key passes the resulting key-value pairs to the reducer.
Worth to note below points about the combiner.
1. The combiner is a class which doesn’t have its own interface and extends the reducer interface and overrides the reduce method.
2. Combiner output key-value pair should be same as the Reducer.
3. The execution of combiner is not guaranteed hence it can’t be predicted how many times a combiner will be invoked.
4. Because of the obvious reasons mentioned above, Combiner won’t work in Map-only jobs.
You must be logged in to reply to this topic.