how to get single file as the output from MapReduce Job

This topic has 3 replies, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 3:22 pm #5491
  
  DataFlair Team
  Spectator
  
  When I submit MapReduce job for small data it is giving one output file. But when I am running same job with huge volumes, there are lots of files in the output directory. Why MapReduce create lots of output files ? How to configure MapReduce job to generate single output file ?
- September 20, 2018 at 3:23 pm #5492
  
  DataFlair Team
  Spectator
  
  Each Reducer produces 1 output file with the name part -r nnnnn, here nnnnn is a running sequence number and it is based on number of reducers are running for a job. Due to this, you are getting lots of output files in output dir.
  To merge these output data files into single output file, we can have a one more mapper/reducer job which will merge these files into 1 file.
  we can also use follwing options
  
  hadoop fs -cat /some/where/on/hdfs/job-output/part-r-* > TheCombinedResultOfTheJob.txt
  
  hdfs dfs -getmerge /some/where/on/hdfs/job-output/ /some/where/on/local-fs/dir
- September 20, 2018 at 3:23 pm #5493
  
  DataFlair Team
  Spectator
  
  Output depends upon the Reducer if N- number of reducers are there in HDFS we get N- number of outputs. If you want single file as a output use single reducer.
  
  1) One simple solution is to configure the job to run with only one reducer.
  2) Another way to deal with this is to have a concatenating script run at the end of your map reduce job that pieces together all the part-r files, ie something like
  
  cat *part-r* >>alloutput
  
  May be a bit more complex if you have headers and also you need to copy to local first.
- September 20, 2018 at 3:23 pm #5494
  
  DataFlair Team
  Spectator
  
  In Hadoop MapReduce job, each Reducer produces one output file with name part-r-nnnnn, where nnnnn is the sequence number of the file and is based on the number of reducers set for the job.
  
  To merge these output files into a single file in HDFS, One way is to use a single Reducer, else, we can create an MR job with a set of mappers and a single reducer.
  
  We can also merge by using the below options:
  
  hdfs dfs -getmerge <source-dir-on-hdfs> <dest-dir-on-localfs>
  
  or
  hdfs dfs -cat <source-dir-on-hdfs/part-r*> >> CombinerResult.txt
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

how to get single file as the output from MapReduce Job

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses