Is it possible to have Hadoop job output in multiple directories? If yes, how?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 5:04 pm #6075
  
  DataFlair Team
  Spectator
  
  Is it possible to have Apache Hadoop job output in multiple directories? how?
- September 20, 2018 at 5:04 pm #6077
  
  DataFlair Team
  Spectator
  
  Yes, it is possible to have the output of Hadoop MapReduce Job written to multiple directories.
  
  In Hadoop MapReduce, the output of Reducer is the final output of a Job, and thus its written in to the Hadoop Local File System(HDFS).
  
  The task of writing the output to the HDFS file is done by RecordWriter with the help of OutputFormat.
  
  OutputFormat is an Interface that defines the way in which the (Key, Value) pair produced by Reducer written to the Output Files. TextOutputFormat is the default OutputFormat.
  
  There is an abstract class called FileOutputFormat which is the base class for all file-based OutputFormats.
  And by default this class writes the output to files based on the number of reducers within the same directory.
  FileOutputFormat and its subclasses generate a set of output files in the output directory.
  There is one file per reducer, and files are named by the partition number: part-00000, part-00001, etc.
  
  But to write output to multiple files we use the class called MultipleFileOutputFormat.
  This abstract class extends the FileOutputFormat, allowing to write the output data to different output files.
  There are three basic use cases for this class.
  Case one: This class is used for a map reduce job with at least one reducer. The reducer wants to write data to different files depending on the actual keys. It is assumed that a key (or value) encodes the actual key (value) and the desired location for the actual key (value).
  Case two: This class is used for a map only job. The job wants to use an output file name that is either a part of the input file name of the input data, or some derivation of it.
  Case three: This class is used for a map only job. The job wants to use an output file name that depends on both the keys and the input file name.
  
  For writing output in multiple directories – falls in Case 1.
  
  1. create multiple named output files using MultipleOutputs.
  Add this in your driver code.
  MultipleOutputs.addNamedOutput(job, “OutputFileName”, OutputFormatClass, keyClass, valueClass);
  
  The API provides two overloaded write methods to achieve this.
  
  multipleOutputs.write(“OutputFileName”, new Text(Key), new Text(Value));
  Now, to write the output file to separate output directories, you need to use an overloaded write method with an extra parameter for the base output path.
  
  multipleOutputs.write(“OutputFileName”, new Text(key), new Text(value), baseOutputPath);
  
  Please remember to change your baseOutputPath in each of your implementation.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

Is it possible to have Hadoop job output in multiple directories? If yes, how?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses