Mapper task processes each input record (From RecordReader) and generates a key-value pair. The Mapper store intermediate-output on the local disk.
We cannot perform aggregation in mapper because:
Sorting takes place only on the Reducer function. Thus there is no provision for sorting in the mapper function. Without sorting aggregation is not possible.
To perform aggregation, we need the output of all the mapper function. Thus, this may not be possible to collect in the map phase. Because mappers may be running on different machines where the Data Blocks are present.
If we will try to perform aggregation of data at mapper, it requires communication between all mapper functions. This may be running on different machines. This will consume high network bandwidth and can cause network bottlenecking.
Follow the link to learn more about Mapper in Hadoop
The aggregation can not be done at Mapper phase because aggregation requires sorting of data, and mapper executes per input split ( a Data Blocks ), so it is not possible in mapper because it loses previous input split every time new instance is taken as input. For each row, a new mapper will get initialized. The data processed by mapper is then stored in local disk through shuffling and sorting process before reducer phase.