Why aggregation cannot be done in Hadoop Mapper?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop Why aggregation cannot be done in Hadoop Mapper?

Viewing 2 reply threads
  • Author
    Posts
    • #5628
      DataFlair TeamDataFlair Team
      Spectator

      Why do we need the Reducer to perform aggregation in MapReduce?
      Why can we not perform aggregation in Mapper?

    • #5630
      DataFlair TeamDataFlair Team
      Spectator

      Aggregation cannot be performed in Mapper side. Below are the reasons for the same:
      1. Aggregation requires sorting of data, which happens only at Reducer side.
      2. For aggregation, we require output from all the mappers, which cannot be possible during map phase, because map tasks will be running in different nodes, where data blocks are present.
      3. Mapper is instantiated per InputSplit. Hence, once the InputSplit is processed, the data is lost from mapper and it is written as intermediate output to the local disk.
      Hence, there will not be previous data present in the mapper for aggregation.
      4. If we try to aggregate in mapper, this requires movement of data from all the mapper outputs running in different machines, which increases network congestion.

    • #5631
      DataFlair TeamDataFlair Team
      Spectator

      Aggregation is performed to acquire the final result of the MapReduce job, that is combining the output of the Mapper and displaying the result. To perform the aggregation, the intermediate output from the mapper must undergo shuffling and sorting. Shuffling and sorting is performed to ensure that the values of the same key goes to the same Reducer.

      We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.