Data Block in HDFS | HDFS Blocks & Data Block Size

1. HDFS Data Block Tutorial: Objective

In this tutorial on Data Block in Hadoop HDFS, we will learn what is a data block in HDFS, what is default data block size in HDFS Hadoop, reason why Hadoop block size is 128 MB and various advantages of Hadoop HDFS blocks.

Hadoop HDFS Blocks

Hadoop HDFS Blocks

2. What is a Data Block?

In Hadoop, HDFS splits huge files into small chunks known as data blocks. HDFS Data blocks are the smallest unit of data in a filesystem. We (client and admin) do not have any control over the data block like block location. Namenode decides all such things.
HDFS stores each file as a data block. However, the data block size in HDFS is very large. The default size of the HDFS block is 128MB which you can configure as per your requirement. All blocks of the file are the same size except the last block, which can be either the same size or smaller. The files are split into 128 MB blocks and then stored into the Hadoop file system. The Hadoop application is responsible for distributing the data block across multiple nodes.

Data Blocks in Hadoop HDFS

Data Blocks in Hadoop HDFS

Now from above example where file size is 518MB suppose we are using the default configuration of data block size 128MB. Then 5 data blocks are created, the first four blocks will be of 128MB, but the last data block will be of 6 MB size only. From the above example it clear that it is not necessary that in HDFS, each file stored should be an exact multiple of the configured block size 128mb, 256mb etc., so final block for file uses only as much space as is needed.

Hadoop Quiz

Get the most demanding skills of IT Industry - Learn Hadoop

3. Why is HDFS Data Block size 128 MB in Hadoop?

Many of us have this question in mind that “why the HDFS block size in so large?” Let us understand this.
HDFS have huge data sets, i.e. terabytes and petabytes of data. So like Linux file system which have 4 KB block size, if we had data block size 4KB for HDFS, then we would be having too many data blocks in Hadoop HDFS and therefore too much of metadata. So, managing this huge number of blocks and metadata will create huge overhead and traffic which is something which we don’t want.
On the other hand, data block size can’t be so large that the system is waiting a very long time for one last unit of data processing to finish its work.

4. Advantages of Hadoop Data Blocks

Below are the main advantages of HDFS Blocks in Hadoop:

a. Simplicity of storage management

As the size of data blocks is fixed, so it is very easy to calculate the number of data blocks that can be stored on the disk.

b. Ability to store very large files

HDSF can store very large files which can be even larger than the size of a single disk as the file is broken into hdfs blocks and distributed across various nodes.

c. Fault tolerance and High Availability of HDFS

Blocks are easy to replicate between the datanodes and thus provide fault tolerance and high availability of HDFS.

d. Simple Storage mechanism for datanodes

HDFS blocks simplify the storage of the datanodes. Metadata of all the blocks is maintained by namenode. The datanode doesn’t need to concern about the block metadata like file permissions etc.
This was all about the Data Blocks. Hope you like our HDFS Data Block Tutorial.
See Also-

If you like this article on HDFS blocks or if you have any query regarding this, just drop a comment in the comment section and we will get back to you.

9 Responses

  1. Sasmita Swain says:

    Thank You for your post. Nice Tutorial. It is very helpful.

    • Data Flair says:

      Glad, our tutorial helped you is such a pleasure for us. We are regularly updating our content for more informative articles on HDFS for readers like you. Keep reading our blogs and keep sharing your experience with us.

  2. Gopal Krishna says:

    Very nicely explained HDFS Blocks, but I have one doubt in your example you mentioned a file with 518 MB of size which will create 5 data blocks in HDFS the last one will occupy only 6 MB which will leave 122 MB of free space. Would this space be filled up when we are writing next file of lets say same size 518 MB. Or this space will be wasted.

    • Amith V Kowndinya says:

      Space won’t be wasted, Actually, it will create a smaller block.
      For Example: if we want to write a file of size 10 MB, then 10 MB sized block will be allocated rather than 128 MB.

  3. DPT says:

    Lets say block size is 64MB and updated to 128MB, the new data after updation will be storing as per 128MB. What about the old data? Will it be same or we can update that also as per 128MB?

    • Rahul says:

      Yes, when you update the block size (from 64 MB to 128 MB) in the configuration file (hdfs-site.xml), newer data will be created with the recent block size ie 128 MB.
      Now the old data will remain in 64 MB block size, but yes, we can update it to 128 MB block size, for this you can run copy command (or distcp), make sure to delete older data.

  4. Dipak says:

    I have a question here, as shown in the above example block size is 128MB, and the last block of a file is 6MB, My question is what will happen with the remaining space in that data block, will that be reused if yes then how it works?

    • DataFlair support says:

      If the data size is less than block size then small sized block will be allocated. like in the given example since the remaining data is merely 6 MB (which is less than the block size) a block of size 6 MB will be allocated.

  5. Rohit says:

    I have a question, What is Block scanner and how it works??

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.