This topic contains 1 reply, has 1 voice, and was last updated by  dfbdteam3 1 year, 8 months ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
  • #6353


    What is the purpose of Mapper in Hadoop?
    How does Mapper work in Hadoop MapReduce?



    MapReduce are programs, designed to compute large volumes of data in a parallel fashion, which requires dividing the workload across a large number of machines (nodes). The basic notion of MapReduce is to divide a task into subtasks, handle the sub-tasks in parallel, and combine the results of the subtasks to form the final output.
    MapReduce consists of two key functions: Mapper and Reducer

    Mapper is a function which process the input data. The mapper processes the data and creates several small chunks of data. The input to the mapper function is in the form of (key, value) pairs, even though the input to a MapReduce program is a file or directory (which is stored in the HDFS).

    Working of Mapper in MapReduce:

    1. The input data from the users is passed to the Mapper which is specified by an InputFormat. InputFormat is specified in the driver code. It defines the location of the input data like a file or directory on HDFS. It also determines how to split the input data into input splits.
    2. Each Mapper deals with a single input split. RecordReader are objects which is a part of InputFormat, used to extract (key, value) records from the input source (split data)
    3. The Mapper processes the input, which are, the (key, value) pairs and provides an output, which are also (key, value) pairs. The output from the Mapper is called the intermediate output.
    4. The Mapper may use or completely ignore the input key. For example, a standard pattern is to read a file one line at a time. The key is the byte offset into the file at which the line starts. The value is the contents of the line itself. Typically the key is considered irrelevant. If the Mapper writes anything out, the output must be in the form of key/value pairs.
    5. The output from the Mapper (intermediate keys and their value lists) are passed to the Reducer in sorted key order.
    6. The Reducer outputs zero or more final key/value pairs. These are written to HDFS. The Reducer usually emits a single key/value pair for each input key
    7. If a Mapper appears to be running more slowly or lagging than the others, a new instance of the Mapper will be started on another machine, operating on the same data. The results of the first Mapper to finish will be used. Hadoop will eliminate the Mapper which is still running

    The number of map tasks in a MapReduce program depends on the number of data blocks of the input file. For example, if the block size is 128MB per block of split data and the input data is of size 1GB, then the number of map tasks will be 8 map tasks. The number of map tasks increases with the increase in the input data and hence parallelism increases which results in faster processing of data.

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.