Why aggregation cannot be done in Mapper?

This topic has 3 replies, 1 voice, and was last updated 7 years, 10 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 4:00 pm #5712
  
  DataFlair Team
  Spectator
  
  Why do we need the Reducer to perform aggregation in MapReduce?
  Why we cannot perform aggregation in Mapper?
- September 20, 2018 at 4:00 pm #5713
  
  DataFlair Team
  Spectator
  
  Mapper task processes each input record (From RecordReader) and generates a key-value pair. The Mapper store intermediate-output on the local disk.
  We cannot perform aggregation in mapper because:
  
  Sorting takes place only on the Reducer function. Thus there is no provision for sorting in the mapper function. Without sorting aggregation is not possible.
  To perform aggregation, we need the output of all the mapper function. Thus, this may not be possible to collect in the map phase. Because mappers may be running on different machines where the Data Blocks are present.
  If we will try to perform aggregation of data at mapper, it requires communication between all mapper functions. This may be running on different machines. This will consume high network bandwidth and can cause network bottlenecking.
  Follow the link to learn more about Mapper in Hadoop
- September 20, 2018 at 4:01 pm #5714
  
  DataFlair Team
  Spectator
  
  The aggregation can not be done at Mapper phase because aggregation requires sorting of data, and mapper executes per input split ( a Data Blocks ), so it is not possible in mapper because it loses previous input split every time new instance is taken as input. For each row, a new mapper will get initialized. The data processed by mapper is then stored in local disk through shuffling and sorting process before reducer phase.
- September 20, 2018 at 4:01 pm #5715
  
  DataFlair Team
  Spectator
  
  We cannot perform aggregation on Mapper:
  
  1) In a MapReduce job there contains many mappers , so communication is required while aggregation,if mappers communicate with each other it will increase network congestion
  
  2) Before aggregating data ,sorting of data is necessary.
  
  As on Reducer side data is partitioned and sorted,It is best to perform aggregation on reducer side..
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

Why aggregation cannot be done in Mapper?

About DataFlair

Trending Courses in Indore

Trending Courses in Bangalore

Trending Courses in Chennai

Trending Courses in Pune

Trending Courses in Hyderabad

Trending Courses in Delhi NCR