In Mapreduce how to sort intermediate output based on values

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 12:08 pm #4751
  
  DataFlair Team
  Spectator
  
  By default Intermediate output is sorted by keys. How to Sort Intermediate output (from mapper) based on values ? Can we change the Sort Order before giving Input to Reducer ?
- September 20, 2018 at 12:08 pm #4753
  DataFlair Team
  Spectator
  Apache Hadoop sorts intermediate output (output of Mapper) before being shuffled to Reducers. This sorting is done based on the keys.
  
  To sort intermediate data based on values we need to use secondary sorting. By modifying the format of the key, secondary sorting gives us the solution to consider value during the sort phase. There are two possible ways:
  - First approach – In this approach Reducer reads all of the values for a given key and buffer them. And do an in-reducer sort on the values. Since reducer will receive all the values for a given key(huge list of values), which could cause reducer to run out of memory. This approach can work well if a number of values are small.
  - Second approach– In this approach, MapReduce framework sort input values of reducer, by creating a “combined key” ( key-value ). This is achieved by adding value to the key for sorting. This approach will not generate out of memory errors.
    One writes a custom partitioner to make sure that all the data with the same key goes to the same reducer/Custom Comparator. Thus, data arrives at the reducer it is grouped by the natural key.
  Among above 2 approaches, doing an explicit sort on values in the reducer would most likely be faster at the risk of running out of memory. But implementing a “value to key” conversion technique is offloading the sorting to the MapReduce framework. This lies at the heart of what Hadoop/MapReduce is designed to do.
  
  Following steps takes place for secondary sorting using second approach:
  
  1) Use a composite key (value to key conversion):
  Create a composite key by adding a part or entire value to the natural key, as shown below:
  
  Normal Case,
```
(k1,v1) --> Map --> (k2,v2)
(k2,List[v2]) --> Reduce --> (k3,v3)
```
  Here, k2 is natural key of intermediate output
  
  Secondary sort case,
```
(k1,v1) --> Map --> ({k2,v2},v2)
({k2,v2},List[v2]) --> Reduce --> (k3,v3)
```
  Here, {k2,v2} is composite key. k2 is referred as the natural key, using which values will be grouped by.
  
  2) Use a custom key comparator:
  Secondary sorting takes place at custom key comparator. Composite key by k2 and v2 is compared and sorted based on both {k2,v2}. All the components of the composite key are considered.
  
  Extend org.apache.hadoop.io.WritableComparator
  
  3) Use a natural key grouping comparator:
  This comparator according to the natural key “groups” all values together. So that, each key {k2,v2} and its associated values List[v2] will go to one reducer.
```
Extend org.apache.hadoop.io.WritableComparator.
```
  4) Use a natural key partitioner:
  It uses the k2 natural key to partition the data to the reducer(s).
  Extend org.apache.hadoop.mapreduce.Partitioner
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

In Mapreduce how to sort intermediate output based on values

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses