In Map Reduce why map write output to Local Disk instead of HDFS?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop In Map Reduce why map write output to Local Disk instead of HDFS?

Viewing 2 reply threads
  • Author
    Posts
    • #6266
      DataFlair TeamDataFlair Team
      Spectator

      In MapReduce data processing flow output of mapper is written on local disk whereas output of reducer is written on hdfs. Why mappers write output on local disk ? Can we configure mappers to write output on HDFS ?

    • #6267
      DataFlair TeamDataFlair Team
      Spectator

      The output of Mapper is not written on HDFS because, the Block of data are replicated in the datanode based on the replication factor and namenode should hold the metadata of blocks. Suppose the job is killed or terminated due to some failure there would quite huge intermediate output residing on HDFS and cleaning up of the same is tedious.

      To avoid such situation the intermediate output is written on to local disk. Here the output of the Mapper is first written on to a buffer(default size 100MB but can be modified by io.sort.mb property) when a certain threshold determined by io.sort.spill.percent which is 80% by default . When this threshold is reached the data is written on the local disk.

      The output of mapper can be written on to HDFS, if the job is a Map-Only job the intermediate output is the final output thus it can be written onto HDFS by setting the number of Reducer task to zero job.setNumReduceTasks(0);

    • #6268
      DataFlair TeamDataFlair Team
      Spectator

      Mapper output is just a temporary data that only meaningful for reducer not for end user.
      we are interested in more accurate data which will generated after shuffling and sorting phase, so storing it in HDFS with replication is not a good idea.

      Also writing to HDFS is not like writing to local disk.
      It’s a more involved process with Namenode ,assuring that at least copies are written to HDFS. And Namenode will also run a background thread to make additional copies for under replicated blocks.

      During map phase ,while writing to HDFS ,if job fails or user kills the job in between then there will be lots of intermediate files sitting on HDFS for no Reason . It will occupy extra storage space.
      So these are the reasons that map write its Intermediate output in local Disk Instead of HDFS.
      Note: In Map-Only job Mapper output written on HDFS as their is 0 reducertask.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.