how to get single file as the output from MapReduce Job

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop how to get single file as the output from MapReduce Job

Viewing 3 reply threads
  • Author
    Posts
    • #5491
      DataFlair TeamDataFlair Team
      Spectator

      When I submit MapReduce job for small data it is giving one output file. But when I am running same job with huge volumes, there are lots of files in the output directory. Why MapReduce create lots of output files ? How to configure MapReduce job to generate single output file ?

    • #5492
      DataFlair TeamDataFlair Team
      Spectator

      Each Reducer produces 1 output file with the name part -r nnnnn, here nnnnn is a running sequence number and it is based on number of reducers are running for a job. Due to this, you are getting lots of output files in output dir.
      To merge these output data files into single output file, we can have a one more mapper/reducer job which will merge these files into 1 file.
      we can also use follwing options

      hadoop fs -cat /some/where/on/hdfs/job-output/part-r-* > TheCombinedResultOfTheJob.txt

      hdfs dfs -getmerge /some/where/on/hdfs/job-output/ /some/where/on/local-fs/dir

    • #5493
      DataFlair TeamDataFlair Team
      Spectator

      Output depends upon the Reducer if N- number of reducers are there in HDFS we get N- number of outputs. If you want single file as a output use single reducer.

      1) One simple solution is to configure the job to run with only one reducer.
      2) Another way to deal with this is to have a concatenating script run at the end of your map reduce job that pieces together all the part-r files, ie something like

      cat *part-r* >>alloutput

      May be a bit more complex if you have headers and also you need to copy to local first.

    • #5494
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop MapReduce job, each Reducer produces one output file with name part-r-nnnnn, where nnnnn is the sequence number of the file and is based on the number of reducers set for the job.

      To merge these output files into a single file in HDFS, One way is to use a single Reducer, else, we can create an MR job with a set of mappers and a single reducer.

      We can also merge by using the below options:

      hdfs dfs -getmerge <source-dir-on-hdfs> <dest-dir-on-localfs>

      or
      hdfs dfs -cat <source-dir-on-hdfs/part-r*> >> CombinerResult.txt

Viewing 3 reply threads
  • You must be logged in to reply to this topic.