What is InputFormat in hadoop?

This topic has 2 replies, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 5:21 pm #6203
  
  DataFlair Team
  Spectator
  
  Explain What is InputFormat.
- September 20, 2018 at 5:21 pm #6205
  DataFlair Team
  Spectator
  Every Mapper is mapped to one Data Block which it processes.
  There are two components between Block and Mapper:
  1) InputSplit: It is the logical representation of data in the block(Block is the physical representation). By default the size of InputSplit is same as of Block.
  
  2) RecordReader: It is responsible for reading data record by record from block, it then submits each record in form of <Key, Value> pairs to the Mapper.
  
  InputFormat is the component which is responsible for creating InputSplit and RecordReader components i.e how he input files are split-up and read.
  To avoid splitting and process whole data in a Mapper we need to set isSplittable() method of InutFormat as false.
  
  There are various types of Input Format:-
  
  a) Text InputFormat- It is the default InputFormat of MapReduce. It uses each line of each input file as the separate record. Thus, performs no parsing.
  - Key- byte offset.
  - Value- It is the contents of the line, excluding line terminators.
  b) KeyValueText Input Format- It is similar to TextInputFormat. Hence, it treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’).
  - Key- Everything up to tab character.
  - Value- Remaining part of the line after tab character.
  c) SequenceFile InputFormat- It is the InputFormat which reads sequence files.
  - Key & Value- Both are user-defined.
  d) SequenceFileAsText InputFormat- It is another form of SequenceFileInputFormat which converts the sequence file’s keys values to Text objects. By calling ‘tostring()’ conversion is performed on the keys and values. This InputFormat makes sequence files suitable input for streaming.
  
  Follow the link to learn more about InputFormat in Hadoop
- September 20, 2018 at 5:21 pm #6207
  
  DataFlair Team
  Spectator
  
  InputFormat is a Class which exists in org.apache.hadoop.mapreduce package for the below two responsibilities.
  
  1. To provide details on how to split an input file into the splits.
  2. To create a Record-Reader class that will generate the series of key/value pairs from a split.
  
  After this, RecordReader: creates key/value pairs from input splits and writes on to Context, which will be shared with Mapper class. Mapper class’s run() method retrieves these key/value pairs from context by calling getCurrentKey() and getCurrentValue() methods and passes onto map() method for further processing of the record.
  
  There are mainly 5 types of Input Format:-
  
  1) TextInputFormat- Each line will be treated as value
  2) KeyValueTextInputFormat- First value before delimiter is key and rest is value
  3) FixedLengthInputFormat – Each fixed length value is considered to be value
  4) NLineInputFormat- N number of lines is considered one value/record
  5) SequenceFileInputFormat- For binary
  
  Also there is DBInputFormat to read from databases
  
  Follow the link to learn more about InputFormat in Hadoop
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What is InputFormat in hadoop?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses