What are Input Format, Input Split & Record Reader and what they do?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 2:31 pm #5184
  
  DataFlair Team
  Spectator
  
  What is the work of InputFormat, InputSplit and RecordReader ? What is the need of InputFormat, InputSplit and RecordReader ? Explain internal components of MapReduce Engine.
- September 20, 2018 at 2:31 pm #5188
  
  DataFlair Team
  Spectator
  
  InputFormat:
  In Hadoop, Input files stores the data for a MapReducejob. Input files which stores data typically reside in HDFS. Thus, in MapReduce, InputFormat defines how these input files split and read. InputFormat creates Inputsplit.
  Most common InputFormat are:
  1) FileInputFormat- It is the base class for all file-based InputFormat. It specifies input directory where data files are present. FileInputFormat also read all files. And, then divides these files into one or more InputSplits.
  2) TextInputFormat- It is the default InputFormat of MapReduce. It uses each line of each input file as separate record. Thus, performs no parsing. Key- byte offset. Value- It is the contents of the line, excluding line terminators.
  3) KeyValueTextInputFormat- It is similar to TextInputFormat. Hence, it treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’). Key- Everything up to tab character. Value- Remaining part of the line after tab character.
  4) SequenceFileInputFormat- It is the InputFormat which reads sequence files. Key & Value- Both are user-defined.
  
  InputSplit:
  InputFormat creates InputSplit; which is the logical representation of data. Further split divides into records. Thus, each record (which is a key-value pair) ) will process by the Mapper.
  
  By default, split size is approximately equal to HDFS Block size (128 MB). In MapReduce program, Inputsplit is user defined. Thus user can control split size based on the size of data.
  
  InputSplit in MapReduce has a length in bytes and also has set of storage locations i.e. Hostname strings. MapReduce system use storage location to place map tasks as close to split’s data as possible. Map tasks process in the order of the size of the split so that the largest one gets processed first. This minimize the job runtime. In MapReduce, important thing is that InputSplit does not contain the input data. It is just a reference to the data.
  
  By calling ‘getSplit()’ , client calculate the split for job and then send to the Application Master.Application master uses their storage location to schedule map tasks. And that will process them on the cluster. In MapReduce, map task sends the split to the createRecordReader() method on InputFormat. So it will create RecordReader: for the split. Then RecordReader generate record (key-value pair), which it passes to the map function.
  
  RecordReader:
  RecordReader uses the data within the boundaries, defined by InputSplit. It creates key-value pairs for the mapper. In RecordReader “start” is the byte position in the file and at ‘start the RecordReader start generating key-value pairs. The “end” is where RecordReader stop reading records. In MapReduce, RecordReader load data from its source. Thus, it converts the data into key-value pairs suitable for reading by the mapper. RecordReader communicates with the inputsplit until it does not read the complete file. The MapReduce framework defines RecordReader instance by the InputFormat.
  By, default; it uses TextInputFormat for converting data into key-value pairs. specified by the header of a sequence file.
  
  For More detail follow below links:
  InputFormat in Hadoop
  InputSplit in Hadoop
  RecordReader in Hadoop
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

What are Input Format, Input Split & Record Reader and what they do?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses