Where sorting is done in MapReduce Job?

Viewing 1 reply thread
  • Author
    Posts
    • #5915
      DataFlair TeamDataFlair Team
      Spectator

      Where sorting is done in MapReduce Job?

    • #5919
      DataFlair TeamDataFlair Team
      Spectator

      Sorting is done on both Mapper and Reducer node.

      What type of algorithm is used in Mapper and Reducer?
      In mapper, it uses QuickSort algorithm,
      In reducer, it uses MergeSort algorithm

      Once partitioning is done on mapper output, a sorting is applied to each partition based on a key(Consider IntWritable as the type for the key, in this, there is a compareTo() method which tells that whether to sort in ascending or descending order…etc). This sorting is related by default case. Sorting is done based on key within each partition.

      Sometimes you can have a composite key as an output from mapper. In this case, you have to decide whether you want to go with default natural sorting or customized sorting order. A composite key is in sense, you have developed customize key. In this key/Class, you have to override compareTo method. In compareTo() method, how the logic is implemented will be decided whether the key is sorted based on ascending or descending order or absolute value in ascending or descending order. The framework will take care of sorting based on the complete key, that composite key.

      Before moving to customize sorting, I would like to talk about sorting feature from Hadoop API.

      Hadoop framework has provided two type of data comparison sorting. In general, to store data on a disk or database or travel data across the network, sender protocol uses serialization to convert data into a stream of bytes. When target/receiver node receives the stream of bytes, it can directly apply sorting on a stream of bytes or it can De-Serialize the data from the stream of bytes and then apply sorting on Object(Data). In summary, you can apply sorting either on a stream of bytes or on Object(data).

      In Java, we don’t have a flexibility to sort data on serialized data(In predefined class, it implements Comparable interface that has compareTo method which accepts Object as type. For customized sorting. we will implements Comparator interface that two method, one is compare(_,_) and equals() method. For compare() method, it accepts Object as type in general. So in either case, we comparing the data at Object level , not at byte level). But in Hadoop we have flexibility to sort on serialized data. As we all aware that creating an object is an costly operation.

      You need to consider few cases when you want to perform customized sorting,
      Let’s take an example, Custom Key class is LongPair. It has two instance variables of type primitive types. In this case, LongPair need to implements WritableComparable class. In this case, sorting will be applied at bytes level only, not at object level. In LongPair class, compareTo method looks like below. Below class gives an abstract info. you have to write few more methods in this class.

      class LongPair implements WritableComparable<LongPair> {
      private long first;
      private long second;

      public int compareTo(LongPair o) {
      int cmp = compare(first, o.first);
      if(cmp == 0)
      cmp = compare(second, o.second);
      return cmp;
      }

      public static int compare(long a, long b) {
      return (a < b ? -1 : (a == b ? 0 : 1));
      }

      }
      Above shows the default natural sorting order.

      In the above case, if both “first” variable keys are equal, then again I comparing with second variable keys. Both of the cases, we are considering in ascending order. But I want to get the data in descending order for the second variable when “first” variable are equal. Adding minus (-) for the second variable will solve the issue. To do this, we have develop a new class that implements WritableComparator class and override compare method. WritableComparator is a class which implements RawComparator that inturn implements Comparator interface.

      public class LongPairSortingComparator extends
      WritableComparator {

      protected LongPairSortingComparator(){
      super(LongPair.class, true);
      }

      @Override
      public int compare(WritableComparable a,
      WritableComparable b) {
      LongPair lp1 = (LongPair)a;
      LongPair lp2 = (LongPair)b;

      int cmp = LongPair.compare
      (lp1.getFirst(), lp2.getFirst());

      if(cmp == 0) {
      cmp = -LongPair.compare
      (lp1.getSecond(), lp2.getSecond());
      }

      return cmp;
      }
      }
      So far we have seen only when Custom key is key primitive variables.

      Suppose a Custom key can contain Object type(Text Or IntWritable instance variables also). In that case, to perform custom sorting, you need to implement RawComparator interface to get the benefit of byte level comparision when sorting is performed.

      You can perform sorting on values also. But that value class/type may implements WritableComparable interface. If not implemented, Design a new class that implements WritableComparator class or RawComparator interface.

      There are lot more ways for performing sorting. You may heard words like Secondary sorting, total ordering sort, local sort….etc. There are the techniques to perform sorting on mapper and reducer data.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.