Shuffling and Sorting in Hadoop MapReduce

Boost your career with Data Engineering Courses!!

1. Objective

In Hadoop, the process by which the intermediate output from mappers is transferred to the reducer is called Shuffling. Reducer gets 1 or more keys and associated values on the basis of reducers. Intermediated key-value generated by mapper is sorted automatically by key. In this blog, we will discuss in detail about shuffling and Sorting in Hadoop MapReduce.

Here we will learn what is sorting in Hadoop, what is shuffling in Hadoop, what is the purpose of Shuffling and sorting phase in MapReduce, how MapReduce shuffle works and how MapReduce sort works. We will also learn what is secondary sorting in MapReduce?

To learn Hadoop Cloudera CDH5 installation follow this installation guide.

Shuffling and Sorting in Hadoop MapReduce

2. What is Shuffling and Sorting in Hadoop MapReduce?

Before we start with Shuffle and Sort in MapReduce, let us revise the other phases of MapReduce like Mapper, reducer in MapReduce, Combiner, partitioner in MapReduce and inputFormat in MapReduce.

Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs. Data from the mapper are grouped by the key, split among reducers and sorted by the key. Every reducer obtains all values associated with the same key. Shuffle and sort phase in Hadoop occur simultaneously and are done by the MapReduce framework.

Let us now understand both these processes in details below:

3. Shuffling in MapReduce

The process of transferring data from the mappers to reducers is known as shuffling i.e. the process by which the system performs the sort and transfers the map output to the reducer as input. So, MapReduce shuffle phase is necessary for the reducers, otherwise, they would not have any input (or input from every mapper). As shuffling can start even before the map phase has finished so this saves some time and completes the tasks in lesser time.

4. Sorting in MapReduce

The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e. Before starting of reducer, all intermediate key-value pairs in MapReduce that are generated by mapper get sorted by key and not by value. Values passed to each reducer are not sorted; they can be in any order.

Sorting in Hadoop helps reducer to easily distinguish when a new reduce task should start. This saves time for the reducer. Reducer starts a new reduce task when the next key in the sorted input data is different than the previous. Each reduce task takes key-value pairs as input and generates key-value pair as output.

Note that shuffling and sorting in Hadoop MapReduce is not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).

5. Secondary Sorting in MapReduce

If we want to sort reducer’s values, then the secondary sorting technique is used as it enables us to sort the values (in ascending or descending order) passed to each reducer.

6. Conclusion

In conclusion, Shuffling-Sorting occurs simultaneously to summarize the Mapper intermediate output. Shuffling and sorting in Hadoop MapReduce are not performed at all if you specify zero reducers (setNumReduceTasks(0)).

If you find this blog helpful, or you have any query in Shuffling and Sorting in Hadoop, so, please leave a comment. Hope we will solve your queries.
See Also-

Reference

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google

prakash says:
November 8, 2017 at 12:59 pm
Excellent information for easily understand
PRADEEP KUMAR TIWARI says:
November 21, 2017 at 7:13 pm
awesome tutorial….every line is very well explained…lovely
Akash Chaudhary says:
April 26, 2018 at 4:05 am
Does sequence file of the mapper’s output add some dummy data to the original data ?
P.S:Sorry to ask question out of the topic.
Rohit says:
June 30, 2018 at 3:41 pm
Can we sort the result in ascending and descending order? As I know for shuffling we don’t need to write any code as it automatically sort the data in ascending order but what will the code if someone wants sorting in descending order?
Pavan Kumar Reddy Palle says:
February 18, 2019 at 5:59 pm
Explanation is clear and straight
- DataFlair Team says:
  February 19, 2019 at 12:00 pm
  Hi Pavan,
  Thanks for the comment on “Hadoop Shuffling and sorting”. Please refer our sidebar for more MapReduce tutorials.
  Regards,
  DataFlair
Gojko says:
April 20, 2019 at 6:37 pm
This tutorial is really helpful in understanding how hadoop actually works.
Thank you.
Kumar says:
May 29, 2019 at 4:18 pm
Hi,
How does the data start to flow into the Reducer task? Which module or which phase initiates the start of data copy once the mapper function completes the task. How will the mapper know where to send the output of the mapper task data or if its the reducer that starts to copy then how will it know which mapper has finished the tasks?
Thanks.
R.Karthik says:
July 5, 2020 at 7:47 pm
really a very well defined definitions and explanations.thanks for your effort
Ankit yadav says:
July 14, 2021 at 10:13 am
Sorting is done at both the mappers and the reducers. Correct me if I am wrong.
Bhaskar says:
March 7, 2022 at 11:05 am
How does secondary sorting work? From what I understand secondary sorting is for sorting complex keys, instead of values of keys.
Saheed says:
October 1, 2022 at 3:59 pm
There should be a step by step practical implementation/running of a mapreduce program in hadoop showing the metrics

Shuffling and Sorting in Hadoop MapReduce

1. Objective

2. What is Shuffling and Sorting in Hadoop MapReduce?

3. Shuffling in MapReduce

4. Sorting in MapReduce

5. Secondary Sorting in MapReduce

6. Conclusion

12 Responses

Leave a Reply Cancel reply

About DataFlair

Trending Courses

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Data Science Tutorials

Trending Projects

Trending Programming Tutorials

Trending Tutorials