What is identity mapper and reducer?

Viewing 2 reply threads
  • Author
    Posts
    • #5207
      DataFlair TeamDataFlair Team
      Spectator

      What is identity mapper and reducer?
      In which cases can we use them, please explain with an example..

    • #5208
      DataFlair TeamDataFlair Team
      Spectator

      Identity Mappers and Reducers don’t have a body, it only generates key-value pairs and the package is org.apache.hadoop.mapred.identity.

      Identity Mapper and Reducer just like the concept of Identity function in mathematics i.e. do not transform the input and return it as it is in output form.

      1) Identity Mapper takes the input key/value pair and splits it out without any processing.

      2) Identity reducer is a bit different. It does not mean that the reduce step will not take place. It will take place and the related sorting/shuffling will also be performed but there will be no aggregation. So you can use identity reducer if you want to sort your data that is coming from the map but don’t care for any grouping.

      Hope u get it! 🙂

    • #5210
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop provides few predefined Mappers and Reducers classes. There classes are useful for default Mapreduce jobs. We can say predefined Mapper and Reducer classes are also known as Identity Mapper and Reducer classes. These classes are available in org.apache.hadoop.mapred.lib package.

      Along with these classes, Hadoop provides some more predefined classes.

      List of predefined Mapper classes:
      1) Identity Mapper
      2) Inverse Mapper
      3) Token Counter
      4) Regex Mapper
      5) Chain Mapper

      List of predefined Reducer classes:
      1) Identity Reducer
      2) IntSum Reeducer
      3) LongSum Reducer
      3) Chain Reducer

      Identity Mapper is the default mapper class which is provided by Hadoop. Identity Mapper class is a generic class and it can be used with any key-value pairs data types. When you submitted MR JOB, this class will be invoked automatically when no mapper class is specified in MR Driver class. Below is the code available in IdentityMapper class.

      public class IdentityMapper<K, V>
          extends MapReduceBase implements Mapper<K, V, K, V> {
      
        /** The identify function.  Input key/value pair is written directly to
         * output.*/
        public void map(K key, V val,
                        OutputCollector<K, V> output, Reporter reporter)
          throws IOException {
          output.collect(key, val);
        }
      }

      When you look the implementation of map method, it starts dumping the key,value pairs into OutputCollector(Add key,value pairs to output which is an intermediate output(s) –Written in CircularBuffer). Here the key value is the byte offset(cursor position) and the value is the complete line (The default InputFormat is TextInputFormat. For this, the default implementation of RecordReader is LineRecordReader class. What this class say is: it start reading the data from the logical input split, while reading the data from inputsplit, whenever it encounter \n(new line) , it stop reading the data, before \n whatever the data it read is consider as a value). This key and value is given as input to map method of IdentityMapper class. In this map method, it start writing the key and value onto the output (CircularBuffer –each mapper will have this buffer). Once map completes. Then again the same procedure is repeater (reading the data from next byteoffset value).

      IdentityMapper class is defined in old MR API. Map input and output key/value data types must be of same type. Look at the map implementation, you can come to know why we need same types.

      Identity Reducer is the default reducer class provided by Hadoop. When you submitted MR JOB, this class will be invoked automatically when no reducer class is specified in MR Driver class. Below is the code available in IdentityReducer class.

      public class IdentityReducer<K, V>
          extends MapReduceBase implements Reducer<K, V, K, V> {
      
        /** Writes all keys and values directly to output. */
        public void reduce(K key, Iterator<V> values,
                           OutputCollector<K, V> output, Reporter reporter)
          throws IOException {
          while (values.hasNext()) {
            output.collect(key, values.next());
          }
        }
      
      }

      When you look the implementation of reduce method above, it simply writes all its input key,value pairs data into OutputCollector(On HDFS). IdentityReducer class is defined in old MR API. This class doesn’t perform any processing on data and it simply writes all its input data into output (On HDFS).

      Here is the sample:

      package com.dataflair.hr.kpi1;
      
      import org.apache.hadoop.conf.Configuration;
      import org.apache.hadoop.fs.Path;
      import org.apache.hadoop.io.FloatWritable;
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Job;
      import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
      import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
      import org.apache.hadoop.util.GenericOptionsParser;
      
      public class IdentityMapReduce {
      
      		public static void main(String[] args) throws Exception{
      			Configuration conf = new Configuration();
      			Job job = new Job(conf, "Identity Map and Reduce Job-1");
      			String[] otherArgs = new GenericOptionsParser(args).getRemainingArgs();
      			if(otherArgs.length != 2) {
      				System.err.println("Identity Job: Usage <in> <out>");
      				System.exit(2);
      			}
      			job.setJarByClass(IdentityMapReduce.class);
      
      			FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
      			FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
      			System.exit(job.waitForCompletion(true) ? 0 : 1);
      
      		}
      	}

      To run MR Job:
      yarn jar IdentityJobJar.jar com.dataflair.hr.kpi1.IdentityMapReduce emp-ctc emp-ctc-out

Viewing 2 reply threads
  • You must be logged in to reply to this topic.