Data Block in HDFS – HDFS Blocks & Data Block Size
Have you ever thought about how the Hadoop Distributed File system stores files of large size?
Hadoop is known for its reliable storage. Hadoop HDFS can store data of any size and format.
HDFS in Hadoop divides the file into small size blocks called data blocks. These data blocks serve many advantages to the Hadoop HDFS. Let us study these data blocks in detail.
In this article, we will study data blocks in Hadoop HDFS. The article discusses:
- What is a HDFS data block and the size of the HDFS data block?
- Blocks created for a file with an example.
- Why are blocks in HDFS huge?
- Advantages of Hadoop Data Blocks
Let us first begin with an introduction to the data block and its default size.
What is a data block in HDFS?
Files in HDFS are broken into block-sized chunks called data blocks. These blocks are stored as independent units.
The size of these HDFS data blocks is 128 MB by default. We can configure the block size as per our requirement by changing the dfs.block.size property in hdfs-site.xml
Hadoop distributes these blocks on different slave machines, and the master machine stores the metadata about blocks location.
All the blocks of a file are of the same size except the last one (if the file size is not a multiple of 128). See the example below to understand this fact.
Stay updated with latest technology trends
Join DataFlair on Telegram!!
Suppose we have a file of size 612 MB, and we are using the default block configuration (128 MB). Therefore five blocks are created, the first four blocks are 128 MB in size, and the fifth block is 100 MB in size (128*4+100=612).
From the above example, we can conclude that:
- A file in HDFS, smaller than a single block does not occupy a full block size space of the underlying storage.
- Each file stored in HDFS doesn’t need to be an exact multiple of the configured block size.
Now let’s see the reasons behind the large size of the data blocks in HDFS.
Why are blocks in HDFS huge?
The default size of the HDFS data block is 128 MB. The reasons for the large size of blocks are:
- To minimize the cost of seek: For the large size blocks, time taken to transfer the data from disk can be longer as compared to the time taken to start the block. This results in the transfer of multiple blocks at the disk transfer rate.
- If blocks are small, there will be too many blocks in Hadoop HDFS and thus too much metadata to store. Managing such a huge number of blocks and metadata will create overhead and lead to traffic in a network.
Advantages of Hadoop Data Blocks
1. No limitation on the file size
A file can be larger than any single disk in the network.
2. Simplicity of storage subsystem
Since blocks are of fixed size, we can easily calculate the number of blocks that can be stored on a given disk. Thus provide simplicity to the storage subsystem.
3. Fit well with replication for providing Fault Tolerance and High Availability
4. Eliminating metadata concerns
Since blocks are just chunks of data to be stored, we don’t need to store file metadata (such as permission information) with the blocks, another system can handle metadata separately.
We can conclude that the HDFS data blocks are blocked-sized chunks having size 128 MB by default. We can configure this size as per our requirements. The files smaller than the block size do not occupy the full block size. The size of HDFS data blocks is large in order to reduce the cost of seek and network traffic.
The article also enlisted the advantages of data blocks in HDFS.
You can even check the number of data blocks for a file or blocks location using the fsck Hadoop command.
If you like this article on HDFS blocks or if you have any query regarding this, just drop a comment in the comment section and we will get back to you.