What are the most common OutputFormat in Hadoop?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What are the most common OutputFormat in Hadoop?

Viewing 2 reply threads
  • Author
    Posts
    • #6216
      DataFlair TeamDataFlair Team
      Spectator

      What is OutputFormat in Hadoop MapReduce?
      How many types of OutputFormat is there in Hadoop?
      What are the different types of OutputFormat in MapReduce?

    • #6217
      DataFlair TeamDataFlair Team
      Spectator

      The default OutputFormat in hadoop is TextOuputFormat. If the file output format is not specified explicitly, then text files are created as output files.
      TextOutputFormat: It writes out records, one per line, by converting keys and values to strings and separating them with a tab character.The tab-separated output is a feature of TextOutputFormat.

      Follow the link to learn more about OutputFormat in Hadoop

    • #6219
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop provides output formats that corresponding to each input format. All hadoop output formats must implement the interface org.apache.hadoop.mapreduce.OutputFormat.

      OutputFormat describes the output-specification for a Map-Reduce job. Based on Output specification,

      MapReduce job checks that the output directory doesn’t already exist.

      OutputFormat provides the RecordWriter implementation to be used to write out the output files of the job.

      These two requirements of the OutputFormat are accomplished with below two methods in the interface.

      public abstract void checkOutputSpecs(JobContext context)
       throws IOException, InterruptedException
      {
      }
      1
      2
      3
      4
      
      public abstract void checkOutputSpecs(JobContext context)
       throws IOException, InterruptedException
      {
      }

      This method checks that output directory doesn’t exist already and throws an exception when it already exists, so that output is not overwritten.

      public abstract RecordWriter<K,V> getRecordWriter
      (TaskAttemptContext context) throws
      IOException, InterruptedException
      {
      }
      1
      2
      3
      4
      
      public abstract RecordWriter<K,V> getRecordWriter
      (TaskAttemptContext context) throws
       IOException, InterruptedException
      {
      }

      This method Gets the RecordWriter for the given task.

      org.apache.hadoop.mapreduce.RecordWriter<K,V> class implementations are used to write the output <key, value> pairs to an output file.

      Built-In Hadoop Output Formats

      Hadoop provided some built in InputFormat implementations in the org.apache.hadoop.mapreduce.lib.output package:

      FileOutputFormat

      Base class for all file-based OutputFormat implementations.

      Some of the important sub classes of the FileOutputFormat class are:

      TextOutputFormat

      The default output format provided by hadoop is TextOuputFormat and it writes records as lines of text. If file output format is not specified explicitly, then text files are created as output files.

      Output Key-value pairs can be of any format because TextOutputFormat converts these into strings with toString() method. Output key-value pairs are tab delimited by default.
      For reading these output text files as input, KeyValueTextInputFormat is best suitable, since it breaks input lines into key value pairs based on a separator character.

      SequenceFileOutputFormat

      This output format class is useful to write out sequence files which is a best option when the output files need to be fed into another mapreduce jobs as input files, since these are compressed and compact.

      SequenceFileAsBinaryOutputFormat

      SequenceFileAsBinaryOutputFormat is a direct subclass of SequenceFileOutputFormat and it is counter part for SequenceFileAsBinaryInputFormat. It writes keys and values to Sequence Files in binary format.

      MapFileOutputFormat

      It is also a direct subclass of FileOutputFormat and it is used to write output as Map files.

      MultipleOutputs

      The MultipleOutputs class is used to write output data to multiple outputs. Below are the two main use cases of MultipleOutputs.

      Job output can be written to additional outputs other than the default output. Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and value class.
      Write data to different files provided by user
      MultipleOutputs supports counters to count the number records written to each output name. But these are disabled by default.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.