This topic contains 3 replies, has 1 voice, and was last updated by  dfbdteam3 1 year, 6 months ago.

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #5491

    dfbdteam3
    Moderator

    When I submit MapReduce job for small data it is giving one output file. But when I am running same job with huge volumes, there are lots of files in the output directory. Why MapReduce create lots of output files ? How to configure MapReduce job to generate single output file ?

    #5492

    dfbdteam3
    Moderator

    Each Reducer produces 1 output file with the name part -r nnnnn, here nnnnn is a running sequence number and it is based on number of reducers are running for a job. Due to this, you are getting lots of output files in output dir.
    To merge these output data files into single output file, we can have a one more mapper/reducer job which will merge these files into 1 file.
    we can also use follwing options

    hadoop fs -cat /some/where/on/hdfs/job-output/part-r-* > TheCombinedResultOfTheJob.txt

    hdfs dfs -getmerge /some/where/on/hdfs/job-output/ /some/where/on/local-fs/dir

    #5493

    dfbdteam3
    Moderator

    Output depends upon the Reducer if N- number of reducers are there in HDFS we get N- number of outputs. If you want single file as a output use single reducer.

    1) One simple solution is to configure the job to run with only one reducer.
    2) Another way to deal with this is to have a concatenating script run at the end of your map reduce job that pieces together all the part-r files, ie something like

    cat *part-r* >>alloutput

    May be a bit more complex if you have headers and also you need to copy to local first.

    #5494

    dfbdteam3
    Moderator

    In Hadoop MapReduce job, each Reducer produces one output file with name part-r-nnnnn, where nnnnn is the sequence number of the file and is based on the number of reducers set for the job.

    To merge these output files into a single file in HDFS, One way is to use a single Reducer, else, we can create an MR job with a set of mappers and a single reducer.

    We can also merge by using the below options:

    hdfs dfs -getmerge <source-dir-on-hdfs> <dest-dir-on-localfs>

    or
    hdfs dfs -cat <source-dir-on-hdfs/part-r*> >> CombinerResult.txt

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.