What is different between a Blocks and Splits?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What is different between a Blocks and Splits?

Viewing 2 reply threads
  • Author
    Posts
    • #4700
      DataFlair TeamDataFlair Team
      Spectator

      When we copy file into HDFS it is divided into blocks and Input to Map-Reduce is a split. What is the difference between Block and split ?

    • #4701
      DataFlair TeamDataFlair Team
      Spectator

      When we talk about data Block in Hadoop, we talk about the physical representation of data.

      InputSplit is the logical representation of data.
      Usually the block size and split size is same but we can change both the sizes accordingly.

      To explain it with an example let’s consider a paragraph:
      ABC DEF …….. KKK
      TMN AKEJHD….. SJKASHDJKH

      Say we’re given the above two lines and we’re using TextInputFormat where one line is processed as one record. Also consider for this case that split size is equal to the block size.
      Now consider that there are two blocks of data, one which stores the data from the start of file till TMN and the other block stores the rest of file.
      But now the catch for TextInputFormat is that it processes the records line by line. So the first record ends at KKK and the second record begins at TMN. Hence, first split ends at KKK and the second split begins at TMN.
      So you can clearly note now that how block and split differ from each other.

      Another point to note is that block is dependent on the size configured for HDFSwhereas, split is dependent on the combination of Record Reader and the Input Format specified for the MapReduce job along with the size configured.
      The number of Map Tasks for a job are dependent on the size of split. Bigger the size of split configured, lesser would be the number of Map Tasks. This is because each split would consist of more than one block. Hence, lesser number of Map Tasks would be required to process the data.
      Learn more about InputSplit vs Block in Hadoop.

    • #4702
      DataFlair TeamDataFlair Team
      Spectator

      Block 
      Normally Data is represented as a file on the FileSystem(Linux, DOS etc).
      When the same data is pushed to HDFS Cluster, Hadoop Manages the files by dividing it in to Blocks, which has a default size of 128MB.

      At any given point of time, one single block will be a part/single file Only.
      These blocks are later replicated to all other Data nodes based on the replication factor, which is by default set to 3. ( which means the same block will be present on the HDFS clusters in 3 different place).

      Basically Block is how the data is stored in HDFS. When reading or writing in to HDFS, it reads as a Block and writes as a Block.

      InputSplit

      Since the data is stored as block on HDFS, There are always chances that some part of the content are in two blocks. So make it standard on what is given as input to mapper. A logical representation of data is done which is input split.

      In otherwords, Input Split is how the recordReader presents the data to the mappers. RecordReader passes one input Split to be processed by the mapper at a time. Normally (you can consider it a line as one input split).

      How the data is split depends upon InputFormat. Default InputFormat is FileInputFormat which uses lineFeed for InputSplit. So here every line is considered a input split, even though the data are part of 2 different block.

      Learn more about InputSplit vs Block in Hadoop.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.