Job-ready Courses with Certificates – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › how to get single file as the output from MapReduce Job
- This topic has 3 replies, 1 voice, and was last updated 6 years, 6 months ago by
DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 3:22 pm #5491
DataFlair Team
SpectatorWhen I submit MapReduce job for small data it is giving one output file. But when I am running same job with huge volumes, there are lots of files in the output directory. Why MapReduce create lots of output files ? How to configure MapReduce job to generate single output file ?
-
September 20, 2018 at 3:23 pm #5492
DataFlair Team
SpectatorEach Reducer produces 1 output file with the name part -r nnnnn, here nnnnn is a running sequence number and it is based on number of reducers are running for a job. Due to this, you are getting lots of output files in output dir.
To merge these output data files into single output file, we can have a one more mapper/reducer job which will merge these files into 1 file.
we can also use follwing optionshadoop fs -cat /some/where/on/hdfs/job-output/part-r-* > TheCombinedResultOfTheJob.txt
hdfs dfs -getmerge /some/where/on/hdfs/job-output/ /some/where/on/local-fs/dir
-
September 20, 2018 at 3:23 pm #5493
DataFlair Team
SpectatorOutput depends upon the Reducer if N- number of reducers are there in HDFS we get N- number of outputs. If you want single file as a output use single reducer.
1) One simple solution is to configure the job to run with only one reducer.
2) Another way to deal with this is to have a concatenating script run at the end of your map reduce job that pieces together all the part-r files, ie something likecat *part-r* >>alloutput
May be a bit more complex if you have headers and also you need to copy to local first.
-
September 20, 2018 at 3:23 pm #5494
DataFlair Team
SpectatorIn Hadoop MapReduce job, each Reducer produces one output file with name part-r-nnnnn, where nnnnn is the sequence number of the file and is based on the number of reducers set for the job.
To merge these output files into a single file in HDFS, One way is to use a single Reducer, else, we can create an MR job with a set of mappers and a single reducer.
We can also merge by using the below options:
hdfs dfs -getmerge <source-dir-on-hdfs> <dest-dir-on-localfs>
or
hdfs dfs -cat <source-dir-on-hdfs/part-r*> >> CombinerResult.txt
-
-
AuthorPosts
- You must be logged in to reply to this topic.