Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › Why aggregation cannot be done in Mapper?
- This topic has 3 replies, 1 voice, and was last updated 5 years, 5 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 4:00 pm #5712DataFlair TeamSpectator
Why do we need the Reducer to perform aggregation in MapReduce?
Why we cannot perform aggregation in Mapper? -
September 20, 2018 at 4:00 pm #5713DataFlair TeamSpectator
Mapper task processes each input record (From RecordReader) and generates a key-value pair. The Mapper store intermediate-output on the local disk.
We cannot perform aggregation in mapper because:Sorting takes place only on the Reducer function. Thus there is no provision for sorting in the mapper function. Without sorting aggregation is not possible.
To perform aggregation, we need the output of all the mapper function. Thus, this may not be possible to collect in the map phase. Because mappers may be running on different machines where the Data Blocks are present.
If we will try to perform aggregation of data at mapper, it requires communication between all mapper functions. This may be running on different machines. This will consume high network bandwidth and can cause network bottlenecking.
Follow the link to learn more about Mapper in Hadoop -
September 20, 2018 at 4:01 pm #5714DataFlair TeamSpectator
The aggregation can not be done at Mapper phase because aggregation requires sorting of data, and mapper executes per input split ( a Data Blocks ), so it is not possible in mapper because it loses previous input split every time new instance is taken as input. For each row, a new mapper will get initialized. The data processed by mapper is then stored in local disk through shuffling and sorting process before reducer phase.
-
September 20, 2018 at 4:01 pm #5715DataFlair TeamSpectator
We cannot perform aggregation on Mapper:
1) In a MapReduce job there contains many mappers , so communication is required while aggregation,if mappers communicate with each other it will increase network congestion
2) Before aggregating data ,sorting of data is necessary.
As on Reducer side data is partitioned and sorted,It is best to perform aggregation on reducer side..
-
-
AuthorPosts
- You must be logged in to reply to this topic.