What is Input Split in hadoop?

Viewing 2 reply threads
  • Author
    Posts
    • #5185
      DataFlair TeamDataFlair Team
      Spectator

      explain InputSplit. what is the need of Input Split ? where Input split is created ?

    • #5187
      DataFlair TeamDataFlair Team
      Spectator

      InputSplits are created by logical division of data, which serves as the input to a single Mapper job. Blocks, on the other hand, are created by the physical division of data. One input split can spread across multiple physical blocks of data.

      The basic need of Input splits is to feed accurate logical locations of data correctly to the Mapper so that each Mapper can process complete set of data spread over more than one blocks. When Hadoop submits a job, it splits the input data logically (Input splits) and these are processed by each Mapper. The number of Mappers is equal to the number of input splits created.

      InputFormat.getSplits() is responsible for generating the input splits which uses each split as input for each mapper job.

      Follow the link to learn more about InputSplits in Hadoop

    • #5189
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop framework divides the large file into blocks (64MB or 128 MB) and stores in the slave nodes. HDFS is unware of the content of the block. While writing the data into block it may happen that the record crosses the block limit and part of same record is written on one block and the other is written on other block.
      So, the way Hadoop tracks this split of data is by the logical representation of the data known as Input Split. When Map Reduce client calculates the input splits, it actually checks if the entire record resides in the same block or not. If the record over heads and some part of it is written into another block, the input split captures the location information of the next Block and byte offset of the data needed to complete the record. This usually happens in the multi-line record as Hadoop is intelligent enough to handle the single line record scenario.
      Usually, input split is configured same as the size of block size but consider if the input split is larger than the block size. Input split represents the size of data that will go in one mapper. Consider below example
      • Input split = 256MB
      • Block size = 128 MB
      Then, mapper will process two blocks that can be on different machines. Which means to process the block the mapper will have to transfer the data between machines to process. Hence to avoid the unnecessary data movement (data locality) we usually keep the same Input split as block size.
      Even in case of data locality optimization some data travels over network which is an overhead. This data transfer is temporary.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.