Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Hadoop In Mapreduce how to sort intermediate output based on values

This topic contains 1 reply, has 1 voice, and was last updated by  dfbdteam3 1 year, 6 months ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #4751

    dfbdteam3
    Moderator

    By default Intermediate output is sorted by keys. How to Sort Intermediate output (from mapper) based on values ? Can we change the Sort Order before giving Input to Reducer ?

    #4753

    dfbdteam3
    Moderator

    Apache Hadoop sorts intermediate output (output of Mapper) before being shuffled to Reducers. This sorting is done based on the keys.

    To sort intermediate data based on values we need to use secondary sorting. By modifying the format of the key, secondary sorting gives us the solution to consider value during the sort phase. There are two possible ways:

    • First approach – In this approach Reducer reads all of the values for a given key and buffer them. And do an in-reducer sort on the values. Since reducer will receive all the values for a given key(huge list of values), which could cause reducer to run out of memory. This approach can work well if a number of values are small.
    • Second approach– In this approach, MapReduce framework sort input values of reducer, by creating a “combined key” ( key-value ). This is achieved by adding value to the key for sorting. This approach will not generate out of memory errors.
      One writes a custom partitioner to make sure that all the data with the same key goes to the same reducer/Custom Comparator. Thus, data arrives at the reducer it is grouped by the natural key.

    Among above 2 approaches, doing an explicit sort on values in the reducer would most likely be faster at the risk of running out of memory. But implementing a “value to key” conversion technique is offloading the sorting to the MapReduce framework. This lies at the heart of what Hadoop/MapReduce is designed to do.

    Following steps takes place for secondary sorting using second approach:

    1) Use a composite key (value to key conversion):
    Create a composite key by adding a part or entire value to the natural key, as shown below:

    Normal Case,

    (k1,v1) --> Map --> (k2,v2)
    (k2,List[v2]) --> Reduce --> (k3,v3)

    Here, k2 is natural key of intermediate output

    Secondary sort case,

    (k1,v1) --> Map --> ({k2,v2},v2)
    ({k2,v2},List[v2]) --> Reduce --> (k3,v3)

    Here, {k2,v2} is composite key. k2 is referred as the natural key, using which values will be grouped by.

    2) Use a custom key comparator:
    Secondary sorting takes place at custom key comparator. Composite key by k2 and v2 is compared and sorted based on both {k2,v2}. All the components of the composite key are considered.

    Extend org.apache.hadoop.io.WritableComparator

    3) Use a natural key grouping comparator:
    This comparator according to the natural key “groups” all values together. So that, each key {k2,v2} and its associated values List[v2] will go to one reducer.

    Extend org.apache.hadoop.io.WritableComparator.

    4) Use a natural key partitioner:
    It uses the k2 natural key to partition the data to the reducer(s).
    Extend org.apache.hadoop.mapreduce.Partitioner

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.