How to sort intermediate output based on values in MapReduce?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop How to sort intermediate output based on values in MapReduce?

Viewing 2 reply threads
  • Author
    Posts
    • #5675
      DataFlair TeamDataFlair Team
      Spectator

      In MapReduce,by default Intermediate output is sorted by keys.How to Sort Intermediate output based on values ?

    • #5678
      DataFlair TeamDataFlair Team
      Spectator

      The MapReduce framework sorts the data in between map and reduce phase by Key, by default.
      Secondary sorting is used to sort the intermediate output based on values.

      There two possible approach to do so:

      1) In this Reducer reads all the values for a given key and buffers them all. Then do an in-reducer sort on the values. This approach will not scale because the reducer will be receiving all the values for a given key and so it can cause reducer to run out of memory. this can work well when number of values are small.

      2) This approach consists of creating a Composite Key which is having two values i.e. Natural key and Natural value Natural Key should be considered for partitioning and grouping while Natual Value should be considerd for sorting. This approach won’t run out of memory. We need to write a Partitioner class to make sure that all the data with the same key goes to the same Reducer. And so data arrives at Reducer is grouped by the natural key.

      Summary of second approach:

      a) Use a Composite Key(Value-to-Key Conversion): We need to create a Composite key by adding a part or entire value to natural key.
      <K1,V1> –> Map –>(<K2, V2>,V2)
      (<K2,V2>, List(V2)) –> Reduce –> (K3, V3)

      K2,V2 is the composite key. K2 is natural key through which values will be grouped.

      b) Custom Key Comparator: Secondary sorting takes place at custom key comparator. K2 and V2 is compared and sorted based on both {K2,V2}.
      Extend org.apache.hadoop.io.WritableComparator

      c) Natural Key Grouping Comparator: This comparator according to the natural key “groups” all values together. So that, each key {k2,v2} and its associated values List[v2] will go to one reducer.
      Extend org.apache.hadoop.io.WritableComparator

      d) Natural Key Partitioner: It uses the K2 natural key to partition the data to the reducers.
      Extend org.apache.hadoop.mapreduce.Partitioner

    • #5681
      DataFlair TeamDataFlair Team
      Spectator

      In MapReduce, the order of the values within a reduce function call, is typically unspecified and can vary between runs.

      Secondary sort is a technique that allows the MapReduce programmer to control the order that the values show up within a reduce function call.

      Consider the following key-value pair, (key2, list(value2)), as an input for a reducer:
      list(value2) = (V1, V2, …, Vn), where there is no ordering between reducer values (V1, V2, …, Vn).

      The goal of the Secondary Sort pattern is to give some ordering to the values received by a reducer. So, once we apply the pattern to our MapReduce paradigm, then we will have:
      SORT(V1, V2, …, Vn) = (S1, S2, …, Sn)
      list(value2) = (S1, S2, …, Sn)

      where:
      S1 < S2 < … < Sn (ascending order), or
      S1 > S2 > … > Sn (descending order)

      We can do secondary sort by two ways :

      The first approach involves having the reducer read and buffer all of the values for a given key (in an array data structure, for example), then doing an in-reducer sort on the values. This approach will not scale: since the reducer will be receiving all values for a given key, this approach might cause the reducer to run out of memory (java.lang.OutOfMemoryError).

      In second approach we can achieve this using a composite key that contains both the information needed to sort by key and the information needed by value, and then decoupling the grouping of the intermediate data from the sorting of the intermediate data. By sorting, we mean deciding the order that map output key/value pairs are presented to the reduce functions. We want to sort both by the keys and the values. By grouping, we mean deciding which sets of key/value are lumped together into a single call of the reduce function. We want to group only on the keys so that we don’t get a separate call to the reduce function for each unique value.

      The grouping is done in two places – the partitioner, which routes map outputs to reduce tasks, and the grouping comparator, which groups data within a reduce task. Both of these are pluggable per-job. The sorting is pluggable by setting the output key comparator.

      Lets assume that our secondary sorting is on a composite key made out of Last Name and First Name. To implement the secondary sort feature, we need additional plug-in Java classes as follows:

      USE A COMPOSITE KEY
      A solution for secondary sorting involves doing multiple things. First, instead of simply emitting the LastName as the key from the mapper, we need to emit a composite key, a key that has multiple parts( here we can consider lastName and firstName).

      USE A COMPOSITE KEY COMPARATOR
      The composite key comparator is where the secondary sorting takes place. It compares composite key by lastName and then By firstName.

      USE A NATURAL KEY GROUPING COMPARATOR
      The natural key group comparator “groups” values together according to the natural key. Without this component, each K2={lastName,firstName} and its associated data may go to different reducers. Notice here, we only consider the “natural” key.

      USE A NATURAL KEY PARTITIONER
      The natural key partitioner uses the natural key to partition the data to the reducer(s). Again, note that here, we only consider the “natural” key.

      In Driver code we should set below additions :
      job.setPartitionerClass
      job.setGroupingComparatorClass
      job.setSortComparatorClass

      References:
      https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/
      https://www.quora.com/What-is-secondary-sort-in-Hadoop-and-how-does-it-work
      https://www.safaribooksonline.com/library/view/data-algorithms/9781491906170/ch01.html

Viewing 2 reply threads
  • You must be logged in to reply to this topic.