Data node block size in HDFS, why 64MB?

Viewing 2 reply threads
  • Author
    Posts
    • #5758
      DataFlair TeamDataFlair Team
      Spectator

      The default data block size of HDFS/Hadoop is 64MB. The block size in disk is generally 4KB. Why block size is large in HDFS/Hadoop

    • #5759
      DataFlair TeamDataFlair Team
      Spectator

      Block is the smallest unit of data that the file system stores. HDFS stores each file as blocks. In Hadoop 1.x default Block size was 64 MB which can be configured as per the requirement. The files were splitted into 64MB blocks and then stored into the Hadoop filesystem. The hadoop applications were responsible for distributing the data block across the multiple nodes.
      All the blocks of the file are of the same size except the last block, which can be either the same or smaller.

      Why blocks size 64 MB?
      HDFS have huge data sets, i.e. terabytes and petabytes of data. So like Linux file system which have 4 KB block size if we had block size 4KB for HDFS, then we would be having too many blocks and therefore too much of metadata. Managing this huge number of blocks and metadata will create huge overhead. This is something which we don’t want.

      Hadoop 2.x has default block size 128MB.

      Reasons for the increase in Block size from 64MB to 128MB are as follows:

      To improve the NameNode performance.
      To improve the performance of MapReduce job since the number of the mapper is directly dependent on Block size.
      If we are managing a cluster of 1 petabytes and block size is 64 MB, then 15+million blocks will create which is difficult for NameNode to manage. So block size is increased from 64MB to 128MB.
      Follow the link to learn more about Data Block in Hadoop

    • #5761
      DataFlair TeamDataFlair Team
      Spectator

      The Block in HDFS can be configured, But default size is 64 MB and 128 MB in Hadoop version 2.
      If the block size was 4 KB like Unix system, then this would lead to more number of blocks and too many mappers to process this which would degrade performance.
      Importantly, Hadoop is for handling Big Data, hence block size should be large and it also reduces load on namenode as it contains the metadata,the more blocks, the more overload on namenode too.

      Follow the link to learn more about Data Block in Hadoop

Viewing 2 reply threads
  • You must be logged in to reply to this topic.