Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › What is Input Split in hadoop?
- This topic has 2 replies, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 2:31 pm #5185DataFlair TeamSpectator
explain InputSplit. what is the need of Input Split ? where Input split is created ?
-
September 20, 2018 at 2:31 pm #5187DataFlair TeamSpectator
InputSplits are created by logical division of data, which serves as the input to a single Mapper job. Blocks, on the other hand, are created by the physical division of data. One input split can spread across multiple physical blocks of data.
The basic need of Input splits is to feed accurate logical locations of data correctly to the Mapper so that each Mapper can process complete set of data spread over more than one blocks. When Hadoop submits a job, it splits the input data logically (Input splits) and these are processed by each Mapper. The number of Mappers is equal to the number of input splits created.
InputFormat.getSplits() is responsible for generating the input splits which uses each split as input for each mapper job.
Follow the link to learn more about InputSplits in Hadoop
-
September 20, 2018 at 2:31 pm #5189DataFlair TeamSpectator
Hadoop framework divides the large file into blocks (64MB or 128 MB) and stores in the slave nodes. HDFS is unware of the content of the block. While writing the data into block it may happen that the record crosses the block limit and part of same record is written on one block and the other is written on other block.
So, the way Hadoop tracks this split of data is by the logical representation of the data known as Input Split. When Map Reduce client calculates the input splits, it actually checks if the entire record resides in the same block or not. If the record over heads and some part of it is written into another block, the input split captures the location information of the next Block and byte offset of the data needed to complete the record. This usually happens in the multi-line record as Hadoop is intelligent enough to handle the single line record scenario.
Usually, input split is configured same as the size of block size but consider if the input split is larger than the block size. Input split represents the size of data that will go in one mapper. Consider below example
• Input split = 256MB
• Block size = 128 MB
Then, mapper will process two blocks that can be on different machines. Which means to process the block the mapper will have to transfer the data between machines to process. Hence to avoid the unnecessary data movement (data locality) we usually keep the same Input split as block size.
Even in case of data locality optimization some data travels over network which is an overhead. This data transfer is temporary.
-
-
AuthorPosts
- You must be logged in to reply to this topic.