Why mapper write output on local disk ?

Viewing 1 reply thread
  • Author
    Posts
    • #5202
      DataFlair TeamDataFlair Team
      Spectator

      In MapReduce data flow, output of mapper is written on local disk, why ? Explain internals of output write mechanism in Mapper phase. how output of mapper is written on local disk ?

    • #5204
      DataFlair TeamDataFlair Team
      Spectator

      MapReduce takes input record generated by the RecordReader and then generates key-value pair. Mapper out is not simply written on the local disk. The process is more involved and takes advantage of buffering writes in memory and doing some presorting for efficiency reasons.

      In Hadoop, each mapper task has a circular memory buffer. Mapper writes its output circular memory buffer. By default, the buffer is 100 MB. By io.sort.mb property buffer size can be changed. When buffer contents reaches a certain threshold size (io.sort.spill.percent, default 0.80, or 80%). Then background thread will start to spill the contents to disk.
      And Spills are written according to round-robin policy to the directories specified by the mapred.local.dir property.

      Shuffle/Sort in MapReduce
      MapReduce framework before writes the output of each mapper task, partitioning of output take place. It takes place on the basis of the key and sorted. This partitioning specifies that all the values for each key are grouped together. If there is a combiner function, it is run on the output of the sort.
      When the memory buffer reaches the spill threshold, a new spill file is created each time. So that when the map task has written its last output record there will be several spill files.
      The spill files are merged into a single partitioned before the task finishes. The maximum number of streams to merge at once are controlled by configuration propertyio.sort.factor ; by default, it is 10.

      The combiner is run before the output file is written if a combiner function has been specified, and the number of spills is at least three (the value of the min.num.spills.for.combine property). We know that combiners can run repeatedly over the input without affecting the final result. If combiners run, there is less data to write to local disk and to transfer to the reducer because combiner compact map output.
      Compress the map output before it is written to disk, since doing so makes it faster to write to disk, saves disk space. Doing this will reduces the amount of data to transfer to the reducer. Compression is enabled by setting mapred.compress.map.output to true.

      Follow the link to learn more about Mapper in Hadoop

Viewing 1 reply thread
  • You must be logged in to reply to this topic.