What is different between a Blocks and Splits?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 11:49 am #4700
  
  DataFlair Team
  Spectator
  
  When we copy file into HDFS it is divided into blocks and Input to Map-Reduce is a split. What is the difference between Block and split ?
- September 20, 2018 at 11:49 am #4701
  
  DataFlair Team
  Spectator
  
  When we talk about data Block in Hadoop, we talk about the physical representation of data.
  
  InputSplit is the logical representation of data.
  Usually the block size and split size is same but we can change both the sizes accordingly.
  
  To explain it with an example let’s consider a paragraph:
  ABC DEF …….. KKK
  TMN AKEJHD….. SJKASHDJKH
  
  Say we’re given the above two lines and we’re using TextInputFormat where one line is processed as one record. Also consider for this case that split size is equal to the block size.
  Now consider that there are two blocks of data, one which stores the data from the start of file till TMN and the other block stores the rest of file.
  But now the catch for TextInputFormat is that it processes the records line by line. So the first record ends at KKK and the second record begins at TMN. Hence, first split ends at KKK and the second split begins at TMN.
  So you can clearly note now that how block and split differ from each other.
  
  Another point to note is that block is dependent on the size configured for HDFSwhereas, split is dependent on the combination of Record Reader and the Input Format specified for the MapReduce job along with the size configured.
  The number of Map Tasks for a job are dependent on the size of split. Bigger the size of split configured, lesser would be the number of Map Tasks. This is because each split would consist of more than one block. Hence, lesser number of Map Tasks would be required to process the data.
  Learn more about InputSplit vs Block in Hadoop.
- September 20, 2018 at 11:50 am #4702
  
  DataFlair Team
  Spectator
  
  Block
  Normally Data is represented as a file on the FileSystem(Linux, DOS etc).
  When the same data is pushed to HDFS Cluster, Hadoop Manages the files by dividing it in to Blocks, which has a default size of 128MB.
  
  At any given point of time, one single block will be a part/single file Only.
  These blocks are later replicated to all other Data nodes based on the replication factor, which is by default set to 3. ( which means the same block will be present on the HDFS clusters in 3 different place).
  
  Basically Block is how the data is stored in HDFS. When reading or writing in to HDFS, it reads as a Block and writes as a Block.
  
  InputSplit
  
  Since the data is stored as block on HDFS, There are always chances that some part of the content are in two blocks. So make it standard on what is given as input to mapper. A logical representation of data is done which is input split.
  
  In otherwords, Input Split is how the recordReader presents the data to the mappers. RecordReader passes one input Split to be processed by the mapper at a time. Normally (you can consider it a line as one input split).
  
  How the data is split depends upon InputFormat. Default InputFormat is FileInputFormat which uses lineFeed for InputSplit. So here every line is considered a input split, even though the data are part of 2 different block.
  
  Learn more about InputSplit vs Block in Hadoop.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What is different between a Blocks and Splits?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses