Difference between input split and block in Hadoop MapReduce

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop Difference between input split and block in Hadoop MapReduce

Viewing 3 reply threads
  • Author
    Posts
    • #5797
      DataFlair TeamDataFlair Team
      Spectator

      What is the difference between input split and block in Hadoop MapReduce?
      Comparison between input split vs block?

    • #5799
      DataFlair TeamDataFlair Team
      Spectator

      In HDFS, files are broken down into chunks called blocks, typically of size 128 MB. These blocks are spread across the cluster, which enables data high availability and parallel processing. Data is split based on file offset.
      Block:
      Block is the physical representation of data. It contains minimum amount of data that can be read/write.
      It contains actual data.

      InputSplit:
      It is the logical representation of data and it is used in the processing of data in MapReduce program.
      It doesn’t contain actual data, but only a reference to the data. Mapper reads data from inutsplit. Each InputSplit is fed to
      each individual Mapper. Hence, InputSplit size decides the number of Mappers used. By default, InputSplit size is same as Block size but can be user defined.

      For eg., block size in a cluster is 128 MB, and in a file, each record is 100 MB.
      The 1st record will fit into block 1, but the 2nd record will not fit in remaining space of block 1. Instead, it starts at block 1 and ends in block 2. Now, if a mapper is run across the file, it reads record 1 from block 1, but cannot read record 2, as in block 1 only a part of it is available. Here, InputSplit comes into the picture. InputSplit logically combines block1 and block2, so that Mapper can read record 1 and 2 from a single InputSplit.

      It is just a reference to start and end location of the blocks. The start location of an InputSplit can start in one block
      and end in another block.InputSplit respect logical record boundary and that is why it becomes very important.

      Follow the link to learn more about Difference between InputSplits and Blocks in hadoop

    • #5801
      DataFlair TeamDataFlair Team
      Spectator

      InputSplit:: It is the Logical representation of data, It does no actually contain’s data.

      Block: It is the Physical Representation of data, it contains actual data, by default HDFS Block Size is 128mb which can be altered as per requirement.

      Hadoop Understands Line boundaries, so suppose the data is divided as 128mb blocks and if data in Block1 has occupied the entire 128mb block but he data is not entirely fitting in to Block1 and some part of is coming in Block 2 then here Input Split comes into picture as it wont allow to separate the data into different blocks due to logical represenation.

      Follow the link to learn more about Difference between InputSplits and Blocks in hadoop

    • #5803
      DataFlair TeamDataFlair Team
      Spectator

      Block:
      Block is the Physical data Boundary, it is where the actual data is stored.
      The Default Block size is 128mb. Block can even end before the actual data ends.

      InputSplit::
      It is the Logical representation of data, It does not consume space. An Input Split can start in one block and end in another block

      During execution of a job hadoop scans through the blocks and creates inputsplit so that mapper can read each inputsplit

Viewing 3 reply threads
  • You must be logged in to reply to this topic.