What is Input Split in Hadoop MapReduce?

Viewing 2 reply threads
  • Author
    Posts
    • #6287
      DataFlair TeamDataFlair Team
      Spectator

      Explain Inputsplit in Hadoop?
      What is default InputSplit size in Hadoop?
      What is InputSplit used for? Where is it created in Hadoop?

    • #6288
      DataFlair TeamDataFlair Team
      Spectator

      The InputSplit is a logical representation of data on which a mapper task will be executed.

      The Mapper task reads from each inputsplit ,it records the first and last record from the block, if the last record in the split is incomplete the split stores the next block location details and the data to complete the record.

      The Default split size is size of an HDFS block.
      We can control this value by setting the mapred.min.split.size parameter in hadoop-site.xml,

      No of InputSplits = No of mappers.
      It is created after the . InputFormat defines the way by which input file will be divided into splits.

      Follow the link for more detail: InputSplit in hadoop

    • #6289
      DataFlair TeamDataFlair Team
      Spectator

      Let’s take an example, we have a json file of 300MB and block size is 128MB. So we would have three block (128MB, 128MB, 44MB).
      We know that a file is stored as blocks in hdfs. Also, a record in JSON or XML can have multiple lines and hdfs won’t know while the splitting the file into blocks that where the record starts or ends. There are high chances that a record will be divided into two parts. If we process the incomplete record we won’t get the right result.

      Mappers has one feature that solves this issue. Mapper checks the InputSplitand it takes the data from next split to complete the incomplete record. Similarly, it will ignore the incomplete record at the start of input block.
      (Input block)+(data required complete the last record) - (incomplete fast record)=(Input Block)
      Input split is calculated at the run time for a file that can have a record of multiple lines. For single line record size of Input split will be same as of Input block.

      Follow the link for more detail: InputSplit in hadoop

Viewing 2 reply threads
  • You must be logged in to reply to this topic.