What is Input Split in Hadoop MapReduce?

This topic has 2 replies, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 5:36 pm #6287
  
  DataFlair Team
  Spectator
  
  Explain Inputsplit in Hadoop?
  What is default InputSplit size in Hadoop?
  What is InputSplit used for? Where is it created in Hadoop?
- September 20, 2018 at 5:37 pm #6288
  
  DataFlair Team
  Spectator
  
  The InputSplit is a logical representation of data on which a mapper task will be executed.
  
  The Mapper task reads from each inputsplit ,it records the first and last record from the block, if the last record in the split is incomplete the split stores the next block location details and the data to complete the record.
  
  The Default split size is size of an HDFS block.
  We can control this value by setting the mapred.min.split.size parameter in hadoop-site.xml,
  
  No of InputSplits = No of mappers.
  It is created after the . InputFormat defines the way by which input file will be divided into splits.
  
  Follow the link for more detail: InputSplit in hadoop
- September 20, 2018 at 5:37 pm #6289
  
  DataFlair Team
  Spectator
  
  Let’s take an example, we have a json file of 300MB and block size is 128MB. So we would have three block (128MB, 128MB, 44MB).
  We know that a file is stored as blocks in hdfs. Also, a record in JSON or XML can have multiple lines and hdfs won’t know while the splitting the file into blocks that where the record starts or ends. There are high chances that a record will be divided into two parts. If we process the incomplete record we won’t get the right result.
  
  Mappers has one feature that solves this issue. Mapper checks the InputSplitand it takes the data from next split to complete the incomplete record. Similarly, it will ignore the incomplete record at the start of input block.
  (Input block)+(data required complete the last record) - (incomplete fast record)=(Input Block)
  Input split is calculated at the run time for a file that can have a record of multiple lines. For single line record size of Input split will be same as of Input block.
  
  Follow the link for more detail: InputSplit in hadoop
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What is Input Split in Hadoop MapReduce?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses