MapReduce – Introduction to Hadoop MapReduce for Beginners   Recently updated !


1. Objective

This MapReduce tutorial describes all the concepts of Hadoop MapReduce in great details. In this tutorial, we will understand what is Mapper, Reducer, shuffling, and sorting, etc. This comprehensive Guide of MapReduce also covers internals of MapReduce, DataFlow, architecture, and Data locality as well.

Apache Hadoop MapReduce Tutorial for beginners.

2. Hadoop MapReduce Tutorial

In this Hadoop MapReduce tutorial we will discuss what is MapReduce, how it divides the work into sub-work, why MapReduce is one of the best paradigms to process data:

MapReduce is the processing layer of Hadoop. MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You need to put business logic in the way MapReduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves.

MapReduce programs are written in a particular style influenced by functional programming constructs, specifical idioms for processing lists of data. Here in MapReduce, we get inputs from a list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to MapRreduce as here parallel processing is done.

learn Big data Technologies and Hadoop concepts. 

2.1. MapReduce – High-level Understanding

Let’s understand the basics of MapReduce, at a high level how MapReduce looks like, what, why and how MapReduce works?

Map-Reduce divides the work into small parts, each of which can be done in parallel on the cluster of servers. A problem is divided into a large number of smaller problems each of which is processed to give individual outputs. These individual outputs are further processed to give final output.

Hadoop Map-Reduce is scalable and can also be used across many computers. Many small machines can be used to process jobs that could not be processed by a large machine.

2.2. MapReduce Terminologies

Let’s now understand different terminologies and concepts of MapReduce, what is Map, what is Reduce, what is a job, task, task attempt, etc.

Map-Reduce is the data processing component of Hadoop. Map-Reduce programs transform lists of input data elements into lists of output data elements. A Map-Reduce program will do this twice, using two different list processing idioms-

  • Map
  • Reduce

In between Map and Reduce, there is small phase called Shuffle and Sort.

Let’s understand basic terminologies used in Map Reduce.

  • Job – A “full program” – an execution of a Mapper and Reducer across a data set. It is an execution of 2 processing layers i.e mapper and reducer. A MapReduce job is a work that the client wants to be performed. It consists of the input data, the MapReduce Program, and configuration info. So client needs to submit input data, he needs to write Map Reduce program and set the configuration info (These were provided during Hadoop setup in the configuration file and also we specify some configurations in our program itself which will be specific to our map reduce job).
  • Task – An execution of a Mapper or a Reducer on a slice of data. It is also called Task-In-Progress (TIP). It means processing of data is in progress either on mapper or reducer.
  • Task Attempt – A particular instance of an attempt to execute a task on a node. There is a possibility that anytime any machine can go down. For example, while processing data if any node goes down, framework reschedules the task to some other node. This rescheduling of the task cannot be infinite. There is an upper limit for that as well. The default value of task attempt is 4. If a task (Mapper or reducer) fails 4 times, then the job is considered as a failed job. For high priority job or huge job, the value of this task attempt can also be increased.

Install Hadoop and play with MapReduce.

2.3. Map Abstraction

Let us understand the abstract form of Map, the first phase of MapReduce paradigm, what is a map/mapper, what is the input to the mapper, how it processes the data, what is output from the mapper?

The map takes key/value pair as input. Whether data is in structured or unstructured format, framework converts the incoming data into key and value.

  • Key is a reference to the input value.
  • Value is the data set on which to operate.

Map Processing:

  • A function defined by user – user can write custom business logic according to his need to process the data.
  • Applies to every value in value input.

Map produces a new list of key/value pairs:

  • An output of Map is called intermediate output.
  • Can be the different type from input pair.
  • An output of map is stored on the local disk from where it is shuffled to reduce nodes.

2.4. Reduce Abstraction

Now let’s discuss the second phase of MapReduce – Reducer, what is the input to the reducer, what work reducer does, where reducer writes output?

Reduce takes intermediate Key / Value pairs as input and processes the output of the mapper. Usually, in the reducer, we do aggregation or summation sort of computation.

  • Input given to reducer is generated by Map (intermediate output)
  • Key / Value pairs provided to reduce are sorted by key

Reduce processing:

  • A function defined by user – Here also user can write custom business logic and get the final output.
  • Iterator supplies the values for a given key to the Reduce function.

Reduce produces a final list of key/value pairs:

  • An output of Reduce is called Final output.
  • It can be a different type from input pair.
  • An output of Reduce is stored in HDFS.

2.5. How Map and Reduce work Together?

Let us understand how Hadoop MapReduce work together?

Learn, how Hoadoop MapReduce work together?

Input data given to mapper is processed through user defined function written at mapper. All the required complex business logic is implemented at the mapper level so that heavy processing is done by the mapper in parallel as the number of mappers is much more than the number of reducers. Mapper generates an output which is intermediate data and this output goes as input to reducer.

This intermediate result is then processed by user defined function written at reducer and final output is generated. Usually, in reducer very light processing is done. This final output is stored in HDFS and replication is done as usual.

2.6. MapReduce DataFlow

Now let’s understand complete end to end data flow of Hadoop MapReduce, how input is given to the mapper, how mappers process data, where mappers write the data, how data is shuffled from mapper to reducer nodes, where reducers run, what type of processing should be done in the reducers?

Apache Hadoop MapReduce data flow process.

As seen from the diagram, the square block is a slave. There are 3 slaves in the figure. On all 3 slaves mappers will run, and then a reducer will run on any 1 of the slave. For simplicity of the figure, the reducer is shown on a different machine but it will run on mapper node only.

Let us now discuss the map phase:

An input to a mapper is 1 block at a time. (Split = block by default)

An output of mapper is written to a local disk of the machine on which mapper is running. Once the map finishes, this intermediate output travels to reducer nodes (node where reducer will run).

Reducer is the second phase of processing where the user can again write his custom business logic. Hence, an output of reducer is the final output written to HDFS.

By default on a slave, 2 mappers run at a time which can also be increased as per the requirements. It depends again on factors like datanode hardware, block size, machine configuration etc. We should not increase the number of mappers beyond the certain limit because it will decrease the performance.

Mapper writes the output to the local disk of the machine it is working. This is the temporary data. An output of mapper is also called intermediate output. All mappers are writing the output to the local disk. As First mapper finishes, data (output of the mapper) is traveling from mapper node to reducer node. Hence, this movement of output from mapper node to reducer node is called shuffle.

Reducer is also deployed on any one of the datanode only. An output from all the mappers goes to the reducer. All these outputs from different mappers are merged to form input for the reducer. This input is also on local disk. Reducer is another processor where you can write custom business logic. It is the second stage of the processing. Usually to reducer we write aggregation, summation etc. type of functionalities. Hence, Reducer gives the final output which it writes on HDFS.

Map and reduce are the stages of processing. They run one after other. After all, mappers complete the processing, then only reducer starts processing.

Though 1 block is present at 3 different locations by default, but framework allows only 1 mapper to process 1 block. So only 1 mapper will be processing 1 particular block out of 3 replicas. The output of every mapper goes to every reducer in the cluster i.e every reducer receives input from all the mappers. Hence, framework indicates reducer that whole data has processed by the mapper and now reducer can process the data.

An output from mapper is partitioned and filtered to many partitions by the partitioner. Each of this partition goes to a reducer based on some conditions. Hadoop works with key value principle i.e mapper and reducer gets the input in the form of key and value and write output also in the same form. Follow this link to learn How Hadoop works internally?

2.7. Data Locality

Let’s understand what is data locality, how it optimizes Map Reduce jobs, how data locality improves job performance?

Move computation close to the data rather than data to computation”. A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data is very huge. This minimizes network congestion and increases the throughput of the system. The assumption is that it is often better to move the computation closer to where the data is present rather than moving the data to where the application is running. Hence, HDFS provides interfaces for applications to move themselves closer to where the data is present.

Since Hadoop works on huge volume of data and it is not workable to move such volume over the network. Hence it has come up with the most innovative principle of moving algorithm to data rather than data to algorithm. This is called data locality.

3. Conclusion

Hence, MapReduce empowers the functionality of Hadoop. Since it works on the concept of data locality, thus improves the performance. In the next tutorial of mapreduce, we will learn the shuffling and sorting phase in detail.

See Also-

Leave a comment

Your email address will not be published. Required fields are marked *