Job-ready Courses with Certificates – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › What is InputFormat in hadoop?
- This topic has 2 replies, 1 voice, and was last updated 6 years, 7 months ago by
DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 5:21 pm #6203
DataFlair Team
SpectatorExplain What is InputFormat.
-
September 20, 2018 at 5:21 pm #6205
DataFlair Team
SpectatorEvery Mapper is mapped to one Data Block which it processes.
There are two components between Block and Mapper:
1) InputSplit: It is the logical representation of data in the block(Block is the physical representation). By default the size of InputSplit is same as of Block.2) RecordReader: It is responsible for reading data record by record from block, it then submits each record in form of <Key, Value> pairs to the Mapper.
InputFormat is the component which is responsible for creating InputSplit and RecordReader components i.e how he input files are split-up and read.
To avoid splitting and process whole data in a Mapper we need to set isSplittable() method of InutFormat as false.There are various types of Input Format:-
a) Text InputFormat- It is the default InputFormat of MapReduce. It uses each line of each input file as the separate record. Thus, performs no parsing.
- Key- byte offset.
- Value- It is the contents of the line, excluding line terminators.
b) KeyValueText Input Format- It is similar to TextInputFormat. Hence, it treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’).
- Key- Everything up to tab character.
- Value- Remaining part of the line after tab character.
c) SequenceFile InputFormat- It is the InputFormat which reads sequence files.
- Key & Value- Both are user-defined.
d) SequenceFileAsText InputFormat- It is another form of SequenceFileInputFormat which converts the sequence file’s keys values to Text objects. By calling ‘tostring()’ conversion is performed on the keys and values. This InputFormat makes sequence files suitable input for streaming.
Follow the link to learn more about InputFormat in Hadoop
-
September 20, 2018 at 5:21 pm #6207
DataFlair Team
SpectatorInputFormat is a Class which exists in org.apache.hadoop.mapreduce package for the below two responsibilities.
1. To provide details on how to split an input file into the splits.
2. To create a Record-Reader class that will generate the series of key/value pairs from a split.After this, RecordReader: creates key/value pairs from input splits and writes on to Context, which will be shared with Mapper class. Mapper class’s run() method retrieves these key/value pairs from context by calling getCurrentKey() and getCurrentValue() methods and passes onto map() method for further processing of the record.
There are mainly 5 types of Input Format:-
1) TextInputFormat- Each line will be treated as value
2) KeyValueTextInputFormat- First value before delimiter is key and rest is value
3) FixedLengthInputFormat – Each fixed length value is considered to be value
4) NLineInputFormat- N number of lines is considered one value/record
5) SequenceFileInputFormat- For binaryAlso there is DBInputFormat to read from databases
Follow the link to learn more about InputFormat in Hadoop
-
-
AuthorPosts
- You must be logged in to reply to this topic.