Why aggregation cannot be done in Mapper?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop Why aggregation cannot be done in Mapper?

Viewing 3 reply threads
  • Author
    Posts
    • #5712
      DataFlair TeamDataFlair Team
      Spectator

      Why do we need the Reducer to perform aggregation in MapReduce?
      Why we cannot perform aggregation in Mapper?

    • #5713
      DataFlair TeamDataFlair Team
      Spectator

      Mapper task processes each input record (From RecordReader) and generates a key-value pair. The Mapper store intermediate-output on the local disk.
      We cannot perform aggregation in mapper because:

      Sorting takes place only on the Reducer function. Thus there is no provision for sorting in the mapper function. Without sorting aggregation is not possible.
      To perform aggregation, we need the output of all the mapper function. Thus, this may not be possible to collect in the map phase. Because mappers may be running on different machines where the Data Blocks are present.
      If we will try to perform aggregation of data at mapper, it requires communication between all mapper functions. This may be running on different machines. This will consume high network bandwidth and can cause network bottlenecking.
      Follow the link to learn more about Mapper in Hadoop

    • #5714
      DataFlair TeamDataFlair Team
      Spectator

      The aggregation can not be done at Mapper phase because aggregation requires sorting of data, and mapper executes per input split ( a Data Blocks ), so it is not possible in mapper because it loses previous input split every time new instance is taken as input. For each row, a new mapper will get initialized. The data processed by mapper is then stored in local disk through shuffling and sorting process before reducer phase.

    • #5715
      DataFlair TeamDataFlair Team
      Spectator

      We cannot perform aggregation on Mapper:

      1) In a MapReduce job there contains many mappers , so communication is required while aggregation,if mappers communicate with each other it will increase network congestion

      2) Before aggregating data ,sorting of data is necessary.

      As on Reducer side data is partitioned and sorted,It is best to perform aggregation on reducer side..

Viewing 3 reply threads
  • You must be logged in to reply to this topic.