Shuffling and Sorting in Hadoop MapReduce
1. Objective
In Hadoop, the process by which the intermediate output from mappers is transferred to the reducer is called Shuffling. Reducer gets 1 or more keys and associated values on the basis of reducers. Intermediated key-value generated by mapper is sorted automatically by key. In this blog, we will discuss in detail about shuffling and Sorting in Hadoop MapReduce.
Here we will learn what is sorting in Hadoop, what is shuffling in Hadoop, what is the purpose of Shuffling and sorting phase in MapReduce, how MapReduce shuffle works and how MapReduce sort works. We will also learn what is secondary sorting in MapReduce?
To learn Hadoop Cloudera CDH5 installation follow this installation guide.
2. What is Shuffling and Sorting in Hadoop MapReduce?
Before we start with Shuffle and Sort in MapReduce, let us revise the other phases of MapReduce like Mapper, reducer in MapReduce, Combiner, partitioner in MapReduce and inputFormat in MapReduce.
Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs. Data from the mapper are grouped by the key, split among reducers and sorted by the key. Every reducer obtains all values associated with the same key. Shuffle and sort phase in Hadoop occur simultaneously and are done by the MapReduce framework.
Let us now understand both these processes in details below:
3. Shuffling in MapReduce
The process of transferring data from the mappers to reducers is known as shuffling i.e. the process by which the system performs the sort and transfers the map output to the reducer as input. So, MapReduce shuffle phase is necessary for the reducers, otherwise, they would not have any input (or input from every mapper). As shuffling can start even before the map phase has finished so this saves some time and completes the tasks in lesser time.
4. Sorting in MapReduce
The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e. Before starting of reducer, all intermediate key-value pairs in MapReduce that are generated by mapper get sorted by key and not by value. Values passed to each reducer are not sorted; they can be in any order.
Sorting in Hadoop helps reducer to easily distinguish when a new reduce task should start. This saves time for the reducer. Reducer starts a new reduce task when the next key in the sorted input data is different than the previous. Each reduce task takes key-value pairs as input and generates key-value pair as output.
Note that shuffling and sorting in Hadoop MapReduce is not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).
5. Secondary Sorting in MapReduce
If we want to sort reducer’s values, then the secondary sorting technique is used as it enables us to sort the values (in ascending or descending order) passed to each reducer.
6. Conclusion
In conclusion, Shuffling-Sorting occurs simultaneously to summarize the Mapper intermediate output. Shuffling and sorting in Hadoop MapReduce are not performed at all if you specify zero reducers (setNumReduceTasks(0)).
If you find this blog helpful, or you have any query in Shuffling and Sorting in Hadoop, so, please leave a comment. Hope we will solve your queries.
See Also-
Did you like our efforts? If Yes, please give DataFlair 5 Stars on Google
Excellent information for easily understand
awesome tutorial….every line is very well explained…lovely
Does sequence file of the mapper’s output add some dummy data to the original data ?
P.S:Sorry to ask question out of the topic.
Can we sort the result in ascending and descending order? As I know for shuffling we don’t need to write any code as it automatically sort the data in ascending order but what will the code if someone wants sorting in descending order?
Explanation is clear and straight
Hi Pavan,
Thanks for the comment on “Hadoop Shuffling and sorting”. Please refer our sidebar for more MapReduce tutorials.
Regards,
DataFlair
This tutorial is really helpful in understanding how hadoop actually works.
Thank you.
Hi,
How does the data start to flow into the Reducer task? Which module or which phase initiates the start of data copy once the mapper function completes the task. How will the mapper know where to send the output of the mapper task data or if its the reducer that starts to copy then how will it know which mapper has finished the tasks?
Thanks.
really a very well defined definitions and explanations.thanks for your effort
Sorting is done at both the mappers and the reducers. Correct me if I am wrong.
How does secondary sorting work? From what I understand secondary sorting is for sorting complex keys, instead of values of keys.
There should be a step by step practical implementation/running of a mapreduce program in hadoop showing the metrics