What is combiner in Hadoop?

Viewing 2 reply threads
  • Author
    Posts
    • #5899
      DataFlair TeamDataFlair Team
      Spectator

      What is the role of Combiner in Hadoop MapReduce?
      What is the need of combiner in Hadoop?
      What is Combiner in Hadoop MapReduce?

    • #5902
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop MapReduce concept, we have a class in between Mapper and Reducer, called Combiner.
      When a MapReduce(MR) job is run on a large dataset, Map task generates huge chunks of intermediate data, which is passed on to Reduce task. During this phase, the output from Mapper has to travel over the network to the node where Reducer is running. This data movement may cause network congestion if the data is huge.

      To reduce this network congestion, MR framework provides a function called ‘Combiner’, which is also called as ‘Mini-Reducer’
      The role of Combiner is to take the output of Mapper as it’s input, process it and sends its output to the reducer. Combiner reads each key-value pair, combines all the values for the same key, and sends this as input to the reducer, which reduces the data movement in the network. Combiner uses same class as Reducer.

      Example:
      Output from Mapper:

      <What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
      <What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
      <What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
      <How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

      The above key-value pairs are taken as input to the Combiner, which provides below output:

      <What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>
      <know,1> <about,1> <Java,3>
      <is,1> <Virtual,1> <Machine,1>
      <How,1> <enabled,1> <High,1> <Performance,1>

      The above output from Combiner is sent to the Reducer as its input.

      Follow the link to learn more about Combiner in hadoop

    • #5903
      DataFlair TeamDataFlair Team
      Spectator

      The combiner is called semi-reducer in MapReduce. Combiners are optional, which can be specified in MapReduce driver class to process the output of map tasks before submitting it to reducer tasks.

      In MapReduce framework, usually the output from the map tasks is large and data transfer between map and reduce tasks will be high. Since we need to transfer intermediate data (map task output) to reduce across the network, which is expensive.

      Combiner functions summarize the map output records with the same key and output of combiner will be sent over the network to actually reduce task as input.

      When we write mapper or reducer, we implement its interface i.e Mapper Interface and Reducer Interface. But in case of the combiner, it does not have its own interface and it implements Reducer interface and reduce() method will be called on each map output key. The combiner class’s reduce() method must have the same input and output key-value types as the reducer class.

      Combiner functions are suitable for producing summary information from a large data set because combiner will replace that set of original map outputs, ideally with fewer records.

      Hadoop Hadoop doesn’t know that how many times a combiner function will be called for each map output key. Sometimes, it may not be executed but at certain times it may be used once, or more times depending on the size and number of output files generated. The mapper generates these files for each reducer.

      In common practices, the same reducer class is used a combiner class many times. This practice leads to some worst results in some cases. The combiner function only aggregate values. It is important that the combiner class does not have side effects. Moreover, the actual reducer is able to process the results of the combiner.

      Let’s understand the use of combiner with an example:

      Suppose we have following weather dataset, in which we have year and temperature like following

      <year, temp>

      So let’s see the normal working of MapReduce:

      Consider two maps:
      1)
      Map1 produces following output:
      (1950, 0)
      (1950, 20)
      (1950, 10)

      Map2 produces follwing output:
      (1950, 25)
      (1950, 15)

      2) We need to find maximum temperature:
      The input to reducer will be
      (1950, [0, 20, 10, 25, 15])

      3) Reducer Output:
      (1950, 25)

      Let’s do this now with combiner:

      Combiner will find out maximum temperature from each mapper like this:

      (1950, 20) from map1
      (1950, 25) from map2.

      So reducer will now get the following value’s:
      (1950, [20, 25])

      here we can compare the number of values, it’s less when combiner is used.

      Learn more about: Combiner

Viewing 2 reply threads
  • You must be logged in to reply to this topic.