MapReduce is the heart of Hadoop. It is the programming paradigm that allows for massive scalability across hundred or thousands of server in a Hadoop cluster. It is a processing layer of Hadoop.
MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into the set of chunks.We need to write the business logic, then rest work will be taken care by the framework. The problem is divided into a large number of smaller problems each of which is processed independently to produce individual outputs. These individual outputs are produced final out outs.
There are two processes one is Mapper and another is the Reducer.
1) Mapper is used to process the input data, Input data is in the form of file or directory which resides in HDFS. Client needs to write the map reduce program and need to submit the input data. The input file is passed to mapper line by line. It will process the data and produce the output which is called intermediate output. The output of map is stored on the local disk from where it is shuffled to reduce nodes. The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files.
2) Reducer takes an intermediate key/value pairs produced by Map. Reducer has 3 primary phases: shuffle, sort and reduce.
shuffle – Input to the reducer is sorted output of mappers. In this phase, framework fetches all output of mappers.
Sort – The framework groups reducer input by keys.
Reducer is the second phase of processing when the client needs to write the business logic. The output of reducer is the final output that is written to HDFS.