If we set the number of Reducer to 0 (by setting job.setNumreduceTasks(0)), then no reducer will execute and no aggregation will take place. In such case, we will prefer “Map-only job” in Hadoop. Map-Only job–
In Map-Only job, the map does all task with its InputSplit and the reducer do no job. Mapper output is the final output. Between map and reduce phases there is key, sort, and shuffle phase. Sort and shuffle phase are responsible for sorting the keys in ascending order.
Then grouping values based on same keys. This phase is very expensive. If reduce phase is not required we should avoid it. Avoiding reduce phase would eliminate sort and shuffle phase as well. This also saves network congestion. As in shuffling an output of mapper travels to the reducer, when data size is huge, large data travel to the reducer.
In MapReduce job, mapper output is written to local disk before sending to Reducer but in the map-only job, this output is directly written to HDFS. This further saves time and reduces cost as well.
The number of reducer can be set to 0 in driver class by job.setNumreduceTasks(0).This shows that there is no reducer phase and has only map phase.It is called as a map-only job.
The map-only job has only map phase.The output of mapper stores directly on HDFS not on disk. The map output is final output.As it has no reducer phase, the aggregation and sorting is also not done.Generally, in map-reducer job the output after shuffling and sorting goes to the reducer, when the data is huge it needs good network bandwidth. As there is no shuffling and sorting in map-only job there will be less network congestion.