Data Block in HDFS | HDFS Blocks & Data Block Size
1. HDFS Data Block Tutorial: Objective
In this tutorial on Data Block in Hadoop HDFS, we will learn what is a data block in HDFS, what is default data block size in HDFS Hadoop, reason why Hadoop block size is 128 MB and various advantages of Hadoop HDFS blocks.
2. What is a Data Block?
In Hadoop, HDFS splits huge files into small chunks known as data blocks. HDFS Data blocks are the smallest unit of data in a filesystem. We (client and admin) do not have any control over the data block like block location. Namenode decides all such things.
HDFS stores each file as a data block. However, the data block size in HDFS is very large. The default size of the HDFS block is 128MB which you can configure as per your requirement. All blocks of the file are the same size except the last block, which can be either the same size or smaller. The files are split into 128 MB blocks and then stored into the Hadoop file system. The Hadoop application is responsible for distributing the data block across multiple nodes.
Now from above example where file size is 518MB suppose we are using the default configuration of data block size 128MB. Then 5 data blocks are created, the first four blocks will be of 128MB, but the last data block will be of 6 MB size only. From the above example it clear that it is not necessary that in HDFS, each file stored should be an exact multiple of the configured block size 128mb, 256mb etc., so final block for file uses only as much space as is needed.
3. Why is HDFS Data Block size 128 MB in Hadoop?
Many of us have this question in mind that “why the HDFS block size in so large?” Let us understand this.
HDFS have huge data sets, i.e. terabytes and petabytes of data. So like Linux file system which have 4 KB block size, if we had data block size 4KB for HDFS, then we would be having too many data blocks in Hadoop HDFS and therefore too much of metadata. So, managing this huge number of blocks and metadata will create huge overhead and traffic which is something which we don’t want.
On the other hand, data block size can’t be so large that the system is waiting a very long time for one last unit of data processing to finish its work.
4. Advantages of Hadoop Data Blocks
Below are the main advantages of HDFS Blocks in Hadoop:
a. Simplicity of storage management
As the size of data blocks is fixed, so it is very easy to calculate the number of data blocks that can be stored on the disk.
b. Ability to store very large files
HDSF can store very large files which can be even larger than the size of a single disk as the file is broken into hdfs blocks and distributed across various nodes.
c. Fault tolerance and High Availability of HDFS
d. Simple Storage mechanism for datanodes
HDFS blocks simplify the storage of the datanodes. Metadata of all the blocks is maintained by namenode. The datanode doesn’t need to concern about the block metadata like file permissions etc.
This was all about the Data Blocks. Hope you like our HDFS Data Block Tutorial.
If you like this article on HDFS blocks or if you have any query regarding this, just drop a comment in the comment section and we will get back to you.