Explain Inputsplit in Hadoop?

Viewing 3 reply threads
  • Author
    Posts
    • #5704
      DataFlair TeamDataFlair Team
      Spectator

      What is Input Split in Hadoop MapReduce?
      What is default InputSplit size in Hadoop?
      What is InputSplit used for? Where it is created in Hadoop?

    • #5706
      DataFlair TeamDataFlair Team
      Spectator

      InputFormat creates InputSplit; which is the logical representation of data. Further split divides into records. Thus, each record (which is a key-value pair) will process by the Mapper.

      InputSplit size:
      By default, split size is approximately equal to HDFS Block size (128 MB). In MapReduce program, Inputsplit is user defined. Thus the user can control split size based on the size of data.

      InputSplit has a length in bytes and it also has set of storage locations (hostname strings). MapReduce system use storage location to place map tasks as close to split’s data as possible. Map tasks process in the order of the size of the split so that the largest one gets processed first. This minimizes the job runtime. In MapReduce, important thing is that InputSplit does not contain the input data. It is just a reference to the data.

      By calling ‘getSplit()’ splits are calculated for the job by the client who is running the job. And then send to the Application Master. Application Master uses their storage location to schedule map tasks. And that will process them on the cluster. In MapReduce, map task sends the split to the createRecordReader() method on InputFormat. So it will create RecordReader for the split. Then RecordReader generates record (key-value pair), which it passes to the map function.

      Follow the link to learn more about InputSplit in Hadoop

    • #5707
      DataFlair TeamDataFlair Team
      Spectator

      InputSplit in Hadoop is a logical partition of data. This logical partition of data is processed one per Mapper. r. So for large data, physical partition of data requires a large number of mapper execution. To solve it, data is partitioned logically with larger size. First it was 64 MB sized now a days 128MB sized blocks are used in Hadoop. This large size of input split enables easy replication and processing of data.

      Follow the link to learn more about InputSplit in Hadoop

    • #5709
      DataFlair TeamDataFlair Team
      Spectator

      An InputSplit is a unit of work that comprises a single map task in MapReduce. it is of the same size as blocks in HDFS.
      We can control this value by setting the mapred.min.split.size parameter in hadoop-site.xml,

      Hadoop uses a logical representation of the data stored in file blocks, known as input splits.

      When a MapReduce client evaluates the input splits, it figures out the beginning and end of the first whole record in a block. The cases in which the last record in a block is incomplete, the input split includes location information for the next block. It also needs the data to complete the record.

Viewing 3 reply threads
  • You must be logged in to reply to this topic.