Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › What is Input Split in Hadoop MapReduce?
- This topic has 2 replies, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 5:36 pm #6287DataFlair TeamSpectator
Explain Inputsplit in Hadoop?
What is default InputSplit size in Hadoop?
What is InputSplit used for? Where is it created in Hadoop? -
September 20, 2018 at 5:37 pm #6288DataFlair TeamSpectator
The InputSplit is a logical representation of data on which a mapper task will be executed.
The Mapper task reads from each inputsplit ,it records the first and last record from the block, if the last record in the split is incomplete the split stores the next block location details and the data to complete the record.
The Default split size is size of an HDFS block.
We can control this value by setting the mapred.min.split.size parameter in hadoop-site.xml,No of InputSplits = No of mappers.
It is created after the . InputFormat defines the way by which input file will be divided into splits.Follow the link for more detail: InputSplit in hadoop
-
September 20, 2018 at 5:37 pm #6289DataFlair TeamSpectator
Let’s take an example, we have a json file of 300MB and block size is 128MB. So we would have three block (128MB, 128MB, 44MB).
We know that a file is stored as blocks in hdfs. Also, a record in JSON or XML can have multiple lines and hdfs won’t know while the splitting the file into blocks that where the record starts or ends. There are high chances that a record will be divided into two parts. If we process the incomplete record we won’t get the right result.Mappers has one feature that solves this issue. Mapper checks the InputSplitand it takes the data from next split to complete the incomplete record. Similarly, it will ignore the incomplete record at the start of input block.
(Input block)+(data required complete the last record) - (incomplete fast record)=(Input Block)
Input split is calculated at the run time for a file that can have a record of multiple lines. For single line record size of Input split will be same as of Input block.Follow the link for more detail: InputSplit in hadoop
-
-
AuthorPosts
- You must be logged in to reply to this topic.