Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Hadoop › sequence of execution of map, reduce, recordreader, split, combiner, partitioner
September 20, 2018 at 11:57 am #4720
Please explain the sequence of execution of all the components of MapReduce like: map, reduce, recordreader, split, combiner, partitioner, sort, shuffle.
Explain the complete flow of MapReduce.September 20, 2018 at 11:58 am #4721
The very first starting point in the execution of the MapReduce is InputFormat.
InputFormat defines how the data is split up logically. The default input format is TextInputFormat. There are other different InputFormat types as well depending upon the requirements e.g. FileInputFormat, KeyValueInputFormat, NLineInputFormat and TextInputFormat.
Once InputFormat is defined then a logical unit of data called as InputSplit in the context of MR (MapReduce. Equivalent of Split in HDFS is block) is fed to next component which is RecordReader (RR).
One Split/block is processed by one mapper. The RR component reads records from Split and feeds the record to the map function of mapper in the form of Key/Value pair. Each time a record is read, each time it is passed to map function of the mapper e.g. if the split is having 2000 records then map() will be called 2000 times.
After the record has been fed to map() function then your business logic defined in the map() function gets executed and the output of map() function (in the form of Key/Values) is passed to next component called as Partitioner. The output of the Mapper (if reducers are set) is temporary and is NOT written to HDFS. Before the intermediate output is passed to Partitioner, an optional component is executed Combiner (for optimization, if set). The idea behind Combiner is to reduce the data that need to be moved over the network and pre-process the data before it goes to Reducer. Generally, a combiner & reducer is having similar logic.
Partitioner is the most important component in the whole flow. The Partitioner is the component which decides where a given Key/Value pair will go to (which reducer). Before passing on Mapper’s output to Reducer, framework sorts the output based on Key.
Then intermediate output is shuffled to Reducers. Shuffling is the actual movement of the data over the network. How does Partitioner decide which Key/Value to send to which Reducer – the decision is taken based on the hash value of the key. The idea of using this hashing technique is purely for purpose of equal load distribution point of view e.g. hashcode 1-100 will go to redcuer1, 101 to 200 will go to reducer2 and so on.
Point to remember is that shuffling happens as soon as first mapper finishes. Once the reducer task receives the input (output of mapper) from all the mappers then the Reducer executes the custom business logic defined in the reduce function of Reducer. Normally, the aggregation (e.g. summation, max, min etc) related functions are performed in Reducers.
Sometimes, we want to sort the data by value than by keys. This sorting on reducer is achieved by a technique called as Secondary Sorting at Reducer. The final output (output of reducer) is collected by Output Format & RecordWriter.
The RecordWriter combine the output and produces the final output in the form of Key/Value pair. The final output format comes in the form of OutputFormat. This final output is written to HDFS and hence gets replicated based on replication factor. One must remember that it is only Reducer’s output that gets written to HDFS and not of Mapper. The output of mapper is thrown away after shuffling.
Below is the sequence of execution of different components: –
InputFormat –> Split –> RecordReader –> Mapper –> Combiner (Optional) –> Partitioner –> Sort and shuffle –> Reducers –> RecordWriter –> OutputFormat –> Written to HDFS
For more details, please refer: Hadoop MapReduce Data Flow
You must be logged in to reply to this topic.