What is InputFormat in hadoop?

Viewing 2 reply threads
  • Author
    Posts
    • #6203
      DataFlair TeamDataFlair Team
      Spectator

      Explain What is InputFormat.

    • #6205
      DataFlair TeamDataFlair Team
      Spectator

      Every Mapper is mapped to one Data Block which it processes.
      There are two components between Block and Mapper:
      1) InputSplit: It is the logical representation of data in the block(Block is the physical representation). By default the size of InputSplit is same as of Block.

      2) RecordReader: It is responsible for reading data record by record from block, it then submits each record in form of <Key, Value> pairs to the Mapper.

      InputFormat is the component which is responsible for creating InputSplit and RecordReader components i.e how he input files are split-up and read.
      To avoid splitting and process whole data in a Mapper we need to set isSplittable() method of InutFormat as false.

      There are various types of Input Format:-

      a) Text InputFormat- It is the default InputFormat of MapReduce. It uses each line of each input file as the separate record. Thus, performs no parsing.

      • Key- byte offset.
      • Value- It is the contents of the line, excluding line terminators.

      b) KeyValueText Input Format- It is similar to TextInputFormat. Hence, it treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’).

      • Key- Everything up to tab character.
      • Value- Remaining part of the line after tab character.

      c) SequenceFile InputFormat- It is the InputFormat which reads sequence files.

      • Key & Value- Both are user-defined.

      d) SequenceFileAsText InputFormat- It is another form of SequenceFileInputFormat which converts the sequence file’s keys values to Text objects. By calling ‘tostring()’ conversion is performed on the keys and values. This InputFormat makes sequence files suitable input for streaming.

      Follow the link to learn more about InputFormat in Hadoop

    • #6207
      DataFlair TeamDataFlair Team
      Spectator

      InputFormat is a Class which exists in org.apache.hadoop.mapreduce package for the below two responsibilities.

      1. To provide details on how to split an input file into the splits.
      2. To create a Record-Reader class that will generate the series of key/value pairs from a split.

      After this, RecordReader: creates key/value pairs from input splits and writes on to Context, which will be shared with Mapper class. Mapper class’s run() method retrieves these key/value pairs from context by calling getCurrentKey() and getCurrentValue() methods and passes onto map() method for further processing of the record.

      There are mainly 5 types of Input Format:-

      1) TextInputFormat- Each line will be treated as value
      2) KeyValueTextInputFormat- First value before delimiter is key and rest is value
      3) FixedLengthInputFormat – Each fixed length value is considered to be value
      4) NLineInputFormat- N number of lines is considered one value/record
      5) SequenceFileInputFormat- For binary

      Also there is DBInputFormat to read from databases

      Follow the link to learn more about InputFormat in Hadoop

Viewing 2 reply threads
  • You must be logged in to reply to this topic.