Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Hadoop › What is InputFormat in hadoop?
September 20, 2018 at 5:21 pm #6203
Explain What is InputFormat.September 20, 2018 at 5:21 pm #6205
Every Mapper is mapped to one Data Block which it processes.
There are two components between Block and Mapper:
1) InputSplit: It is the logical representation of data in the block(Block is the physical representation). By default the size of InputSplit is same as of Block.
InputFormat is the component which is responsible for creating InputSplit and RecordReader components i.e how he input files are split-up and read.
To avoid splitting and process whole data in a Mapper we need to set isSplittable() method of InutFormat as false.
There are various types of Input Format:-
a) Text InputFormat- It is the default InputFormat of MapReduce. It uses each line of each input file as the separate record. Thus, performs no parsing.
- Key- byte offset.
- Value- It is the contents of the line, excluding line terminators.
b) KeyValueText Input Format- It is similar to TextInputFormat. Hence, it treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’).
- Key- Everything up to tab character.
- Value- Remaining part of the line after tab character.
c) SequenceFile InputFormat- It is the InputFormat which reads sequence files.
- Key & Value- Both are user-defined.
d) SequenceFileAsText InputFormat- It is another form of SequenceFileInputFormat which converts the sequence file’s keys values to Text objects. By calling ‘tostring()’ conversion is performed on the keys and values. This InputFormat makes sequence files suitable input for streaming.
Follow the link to learn more about InputFormat in HadoopSeptember 20, 2018 at 5:21 pm #6207
InputFormat is a Class which exists in org.apache.hadoop.mapreduce package for the below two responsibilities.
1. To provide details on how to split an input file into the splits.
2. To create a Record-Reader class that will generate the series of key/value pairs from a split.
After this, RecordReader: creates key/value pairs from input splits and writes on to Context, which will be shared with Mapper class. Mapper class’s run() method retrieves these key/value pairs from context by calling getCurrentKey() and getCurrentValue() methods and passes onto map() method for further processing of the record.
There are mainly 5 types of Input Format:-
1) TextInputFormat- Each line will be treated as value
2) KeyValueTextInputFormat- First value before delimiter is key and rest is value
3) FixedLengthInputFormat – Each fixed length value is considered to be value
4) NLineInputFormat- N number of lines is considered one value/record
5) SequenceFileInputFormat- For binary
Also there is DBInputFormat to read from databases
Follow the link to learn more about InputFormat in Hadoop
You must be logged in to reply to this topic.