Difference between input split and block in Hadoop MapReduce

This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 4:14 pm #5797
  
  DataFlair Team
  Spectator
  
  What is the difference between input split and block in Hadoop MapReduce?
  Comparison between input split vs block?
- September 20, 2018 at 4:14 pm #5799
  
  DataFlair Team
  Spectator
  
  In HDFS, files are broken down into chunks called blocks, typically of size 128 MB. These blocks are spread across the cluster, which enables data high availability and parallel processing. Data is split based on file offset.
  Block:
  Block is the physical representation of data. It contains minimum amount of data that can be read/write.
  It contains actual data.
  
  InputSplit:
  It is the logical representation of data and it is used in the processing of data in MapReduce program.
  It doesn’t contain actual data, but only a reference to the data. Mapper reads data from inutsplit. Each InputSplit is fed to
  each individual Mapper. Hence, InputSplit size decides the number of Mappers used. By default, InputSplit size is same as Block size but can be user defined.
  
  For eg., block size in a cluster is 128 MB, and in a file, each record is 100 MB.
  The 1st record will fit into block 1, but the 2nd record will not fit in remaining space of block 1. Instead, it starts at block 1 and ends in block 2. Now, if a mapper is run across the file, it reads record 1 from block 1, but cannot read record 2, as in block 1 only a part of it is available. Here, InputSplit comes into the picture. InputSplit logically combines block1 and block2, so that Mapper can read record 1 and 2 from a single InputSplit.
  
  It is just a reference to start and end location of the blocks. The start location of an InputSplit can start in one block
  and end in another block.InputSplit respect logical record boundary and that is why it becomes very important.
  
  Follow the link to learn more about Difference between InputSplits and Blocks in hadoop
- September 20, 2018 at 4:14 pm #5801
  
  DataFlair Team
  Spectator
  
  InputSplit:: It is the Logical representation of data, It does no actually contain’s data.
  
  Block: It is the Physical Representation of data, it contains actual data, by default HDFS Block Size is 128mb which can be altered as per requirement.
  
  Hadoop Understands Line boundaries, so suppose the data is divided as 128mb blocks and if data in Block1 has occupied the entire 128mb block but he data is not entirely fitting in to Block1 and some part of is coming in Block 2 then here Input Split comes into picture as it wont allow to separate the data into different blocks due to logical represenation.
  
  Follow the link to learn more about Difference between InputSplits and Blocks in hadoop
- September 20, 2018 at 4:15 pm #5803
  
  DataFlair Team
  Spectator
  
  Block:
  Block is the Physical data Boundary, it is where the actual data is stored.
  The Default Block size is 128mb. Block can even end before the actual data ends.
  
  InputSplit::
  It is the Logical representation of data, It does not consume space. An Input Split can start in one block and end in another block
  
  During execution of a job hadoop scans through the blocks and creates inputsplit so that mapper can read each inputsplit
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

Difference between input split and block in Hadoop MapReduce

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses