In Map Reduce why map write output to Local Disk instead of HDFS?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 5:32 pm #6266
  
  DataFlair Team
  Spectator
  
  In MapReduce data processing flow output of mapper is written on local disk whereas output of reducer is written on hdfs. Why mappers write output on local disk ? Can we configure mappers to write output on HDFS ?
- September 20, 2018 at 5:32 pm #6267
  
  DataFlair Team
  Spectator
  
  The output of Mapper is not written on HDFS because, the Block of data are replicated in the datanode based on the replication factor and namenode should hold the metadata of blocks. Suppose the job is killed or terminated due to some failure there would quite huge intermediate output residing on HDFS and cleaning up of the same is tedious.
  
  To avoid such situation the intermediate output is written on to local disk. Here the output of the Mapper is first written on to a buffer(default size 100MB but can be modified by io.sort.mb property) when a certain threshold determined by io.sort.spill.percent which is 80% by default . When this threshold is reached the data is written on the local disk.
  
  The output of mapper can be written on to HDFS, if the job is a Map-Only job the intermediate output is the final output thus it can be written onto HDFS by setting the number of Reducer task to zero job.setNumReduceTasks(0);
- September 20, 2018 at 5:33 pm #6268
  
  DataFlair Team
  Spectator
  
  Mapper output is just a temporary data that only meaningful for reducer not for end user.
  we are interested in more accurate data which will generated after shuffling and sorting phase, so storing it in HDFS with replication is not a good idea.
  
  Also writing to HDFS is not like writing to local disk.
  It’s a more involved process with Namenode ,assuring that at least copies are written to HDFS. And Namenode will also run a background thread to make additional copies for under replicated blocks.
  
  During map phase ,while writing to HDFS ,if job fails or user kills the job in between then there will be lots of intermediate files sitting on HDFS for no Reason . It will occupy extra storage space.
  So these are the reasons that map write its Intermediate output in local Disk Instead of HDFS.
  Note: In Map-Only job Mapper output written on HDFS as their is 0 reducertask.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

In Map Reduce why map write output to Local Disk instead of HDFS?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses