What are Input Format, Input Split & Record Reader and what they do?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What are Input Format, Input Split & Record Reader and what they do?

Viewing 1 reply thread
  • Author
    Posts
    • #5184
      DataFlair TeamDataFlair Team
      Spectator

      What is the work of InputFormat, InputSplit and RecordReader ? What is the need of InputFormat, InputSplit and RecordReader ? Explain internal components of MapReduce Engine.

    • #5188
      DataFlair TeamDataFlair Team
      Spectator

      InputFormat:
      In Hadoop, Input files stores the data for a MapReducejob. Input files which stores data typically reside in HDFS. Thus, in MapReduce, InputFormat defines how these input files split and read. InputFormat creates Inputsplit.
      Most common InputFormat are: 
      1) FileInputFormat- It is the base class for all file-based InputFormat. It specifies input directory where data files are present. FileInputFormat also read all files. And, then divides these files into one or more InputSplits.
      2) TextInputFormat- It is the default InputFormat of MapReduce. It uses each line of each input file as separate record. Thus, performs no parsing. Key- byte offset. Value- It is the contents of the line, excluding line terminators.
      3) KeyValueTextInputFormat- It is similar to TextInputFormat. Hence, it treats each line of input as a separate record. But the main difference is that TextInputFormat treats entire line as the value. While the KeyValueTextInputFormat breaks the line itself into key and value by the tab character (‘/t’). Key- Everything up to tab character. Value- Remaining part of the line after tab character.
      4) SequenceFileInputFormat- It is the InputFormat which reads sequence files. Key & Value- Both are user-defined.

      InputSplit:
      InputFormat creates InputSplit; which is the logical representation of data. Further split divides into records. Thus, each record (which is a key-value pair) ) will process by the Mapper.

      By default, split size is approximately equal to HDFS Block size (128 MB). In MapReduce program, Inputsplit is user defined. Thus user can control split size based on the size of data.

      InputSplit in MapReduce has a length in bytes and also has set of storage locations i.e. Hostname strings. MapReduce system use storage location to place map tasks as close to split’s data as possible. Map tasks process in the order of the size of the split so that the largest one gets processed first. This minimize the job runtime. In MapReduce, important thing is that InputSplit does not contain the input data. It is just a reference to the data.

      By calling ‘getSplit()’ , client calculate the split for job and then send to the Application Master.Application master uses their storage location to schedule map tasks. And that will process them on the cluster. In MapReduce, map task sends the split to the createRecordReader() method on InputFormat. So it will create RecordReader: for the split. Then RecordReader generate record (key-value pair), which it passes to the map function.

      RecordReader:
      RecordReader uses the data within the boundaries, defined by InputSplit. It creates key-value pairs for the mapper. In RecordReader “start” is the byte position in the file and at ‘start the RecordReader start generating key-value pairs. The “end” is where RecordReader stop reading records. In MapReduce, RecordReader load data from its source. Thus, it converts the data into key-value pairs suitable for reading by the mapper. RecordReader communicates with the inputsplit until it does not read the complete file. The MapReduce framework defines RecordReader instance by the InputFormat.
      By, default; it uses TextInputFormat for converting data into key-value pairs. specified by the header of a sequence file.

      For More detail follow below links:
      InputFormat in Hadoop
      InputSplit in Hadoop
      RecordReader in Hadoop

Viewing 1 reply thread
  • You must be logged in to reply to this topic.