When I submit MapReduce job for small data it is giving one output file. But when I am running same job with huge volumes, there are lots of files in the output directory. Why MapReduce create lots of output files ? How to configure MapReduce job to generate single output file ?
Each Reducer produces 1 output file with the name part -r nnnnn, here nnnnn is a running sequence number and it is based on number of reducers are running for a job. Due to this, you are getting lots of output files in output dir.
To merge these output data files into single output file, we can have a one more mapper/reducer job which will merge these files into 1 file.
we can also use follwing options
Output depends upon the Reducer if N- number of reducers are there in HDFS we get N- number of outputs. If you want single file as a output use single reducer.
1) One simple solution is to configure the job to run with only one reducer.
2) Another way to deal with this is to have a concatenating script run at the end of your map reduce job that pieces together all the part-r files, ie something like
cat *part-r* >>alloutput
May be a bit more complex if you have headers and also you need to copy to local first.