Why mapper write output on local disk ?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 2:33 pm #5202
  
  DataFlair Team
  Spectator
  
  In MapReduce data flow, output of mapper is written on local disk, why ? Explain internals of output write mechanism in Mapper phase. how output of mapper is written on local disk ?
- September 20, 2018 at 2:33 pm #5204
  
  DataFlair Team
  Spectator
  
  MapReduce takes input record generated by the RecordReader and then generates key-value pair. Mapper out is not simply written on the local disk. The process is more involved and takes advantage of buffering writes in memory and doing some presorting for efficiency reasons.
  
  In Hadoop, each mapper task has a circular memory buffer. Mapper writes its output circular memory buffer. By default, the buffer is 100 MB. By io.sort.mb property buffer size can be changed. When buffer contents reaches a certain threshold size (io.sort.spill.percent, default 0.80, or 80%). Then background thread will start to spill the contents to disk.
  And Spills are written according to round-robin policy to the directories specified by the mapred.local.dir property.
  
  Shuffle/Sort in MapReduce
  MapReduce framework before writes the output of each mapper task, partitioning of output take place. It takes place on the basis of the key and sorted. This partitioning specifies that all the values for each key are grouped together. If there is a combiner function, it is run on the output of the sort.
  When the memory buffer reaches the spill threshold, a new spill file is created each time. So that when the map task has written its last output record there will be several spill files.
  The spill files are merged into a single partitioned before the task finishes. The maximum number of streams to merge at once are controlled by configuration propertyio.sort.factor ; by default, it is 10.
  
  The combiner is run before the output file is written if a combiner function has been specified, and the number of spills is at least three (the value of the min.num.spills.for.combine property). We know that combiners can run repeatedly over the input without affecting the final result. If combiners run, there is less data to write to local disk and to transfer to the reducer because combiner compact map output.
  Compress the map output before it is written to disk, since doing so makes it faster to write to disk, saves disk space. Doing this will reduces the amount of data to transfer to the reducer. Compression is enabled by setting mapred.compress.map.output to true.
  
  Follow the link to learn more about Mapper in Hadoop
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

Why mapper write output on local disk ?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses