This topic contains 3 replies, has 1 voice, and was last updated by  dfbdteam3 1 year, 6 months ago.

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #5712

    dfbdteam3
    Moderator

    Why do we need the Reducer to perform aggregation in MapReduce?
    Why we cannot perform aggregation in Mapper?

    #5713

    dfbdteam3
    Moderator

    Mapper task processes each input record (From RecordReader) and generates a key-value pair. The Mapper store intermediate-output on the local disk.
    We cannot perform aggregation in mapper because:

    Sorting takes place only on the Reducer function. Thus there is no provision for sorting in the mapper function. Without sorting aggregation is not possible.
    To perform aggregation, we need the output of all the mapper function. Thus, this may not be possible to collect in the map phase. Because mappers may be running on different machines where the Data Blocks are present.
    If we will try to perform aggregation of data at mapper, it requires communication between all mapper functions. This may be running on different machines. This will consume high network bandwidth and can cause network bottlenecking.
    Follow the link to learn more about Mapper in Hadoop

    #5714

    dfbdteam3
    Moderator

    The aggregation can not be done at Mapper phase because aggregation requires sorting of data, and mapper executes per input split ( a Data Blocks ), so it is not possible in mapper because it loses previous input split every time new instance is taken as input. For each row, a new mapper will get initialized. The data processed by mapper is then stored in local disk through shuffling and sorting process before reducer phase.

    #5715

    dfbdteam3
    Moderator

    We cannot perform aggregation on Mapper:

    1) In a MapReduce job there contains many mappers , so communication is required while aggregation,if mappers communicate with each other it will increase network congestion

    2) Before aggregating data ,sorting of data is necessary.

    As on Reducer side data is partitioned and sorted,It is best to perform aggregation on reducer side..

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.