Is it possible to have Hadoop job output in multiple directories? If yes, how?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop Is it possible to have Hadoop job output in multiple directories? If yes, how?

Viewing 1 reply thread
  • Author
    Posts
    • #6075
      DataFlair TeamDataFlair Team
      Spectator

      Is it possible to have Apache Hadoop job output in multiple directories? how?

    • #6077
      DataFlair TeamDataFlair Team
      Spectator

      Yes, it is possible to have the output of Hadoop MapReduce Job written to multiple directories.

      In Hadoop MapReduce, the output of Reducer is the final output of a Job, and thus its written in to the Hadoop Local File System(HDFS).

      The task of writing the output to the HDFS file is done by RecordWriter with the help of OutputFormat.

      OutputFormat is an Interface that defines the way in which the (Key, Value) pair produced by Reducer written to the Output Files. TextOutputFormat is the default OutputFormat.

      There is an abstract class called FileOutputFormat which is the base class for all file-based OutputFormats.
      And by default this class writes the output to files based on the number of reducers within the same directory.
      FileOutputFormat and its subclasses generate a set of output files in the output directory.
      There is one file per reducer, and files are named by the partition number: part-00000, part-00001, etc.

      But to write output to multiple files we use the class called MultipleFileOutputFormat.
      This abstract class extends the FileOutputFormat, allowing to write the output data to different output files.
      There are three basic use cases for this class.
      Case one: This class is used for a map reduce job with at least one reducer. The reducer wants to write data to different files depending on the actual keys. It is assumed that a key (or value) encodes the actual key (value) and the desired location for the actual key (value).
      Case two: This class is used for a map only job. The job wants to use an output file name that is either a part of the input file name of the input data, or some derivation of it.
      Case three: This class is used for a map only job. The job wants to use an output file name that depends on both the keys and the input file name.

      For writing output in multiple directories – falls in Case 1.

      1. create multiple named output files using MultipleOutputs.
      Add this in your driver code.
      MultipleOutputs.addNamedOutput(job, “OutputFileName”, OutputFormatClass, keyClass, valueClass);

      The API provides two overloaded write methods to achieve this.

      multipleOutputs.write(“OutputFileName”, new Text(Key), new Text(Value));
      Now, to write the output file to separate output directories, you need to use an overloaded write method with an extra parameter for the base output path.

      multipleOutputs.write(“OutputFileName”, new Text(key), new Text(value), baseOutputPath);

      Please remember to change your baseOutputPath in each of your implementation.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.