Data Blocks in Hadoop HDFS – Hadoop Distributed File System 1

1. Objective

In this tutorial on data Blocks in Hadoop HDFS, we will learn what is a block in HDFS, what is default data block size in HDFS Hadoop, reason why Hadoop block size is 128 MB and various advantages of Hadoop HDFS blocks.

Data Blocks in Hadoop HDFS

2. Introduction to Data Blocks in Hadoop HDFS

Let us first understand what is a block in HDFS?

In Hadoop, HDFS splits huge files into small chunks known as blocks. These are the smallest unit of data in a filesystem. We (client and admin) do not have any control on the block like block location. Namenode decides all such things.

HDFS stores each file as blocks. However, the block size in HDFS is very large. The default size of the HDFS block is 128MB which you can configure as per your requirement. All blocks of the file are the same size except the last block, which can be either the same size or smaller. The files are split into 128 MB blocks and then stored into the Hadoop file system. The Hadoop application is responsible for distributing the data block across multiple nodes.

Data Blocks in Hadoop HDFS

Now from above example where file size is 518MB suppose we are using the default configuration of block size 128MB. Then 5 blocks are created, the first four blocks will be of 128MB, but the last block will be of 6 MB size only. From the above example it clear that it is not necessary that in HDFS, each file stored should be in exact multiple of the configured block size 128mb, 256mb etc., so final block for file uses only as much space as is needed.

3. Why is HDFS Block size 128 MB in Hadoop?

Many of us have this question in mind that “why the block size in HDFS is so large?” Let us understand this.

HDFS have huge data sets, i.e. terabytes and petabytes of data. So like Linux file system which have 4 KB block size, if we had block size 4KB for HDFS, then we would be having too many data blocks in Hadoop HDFS and therefore too much of metadata. So, managing this huge number of blocks and metadata will create huge overhead and traffic which is something which we don’t want.

On the other hand, block size can’t be so large that the system is waiting a very long time for one last unit of data processing to finish its work.

4. Advantages of Hadoop HDFS blocks

Below are the main advantages of HDFS Blocks in Hadoop:

a. Simplicity of storage management

As the size of HDFS blocks is fixed, so it is very easy to calculate the number of blocks that can be stored on the disk.

b. Ability to store very large files

HDSF can store very large files which can be even larger than the size of a single disk as the file is broken into blocks and distributed across various nodes.

c. Fault tolerance and High Availability of HDFS

Blocks are easy to replicate between the datanodes and thus provide fault tolerance and high availability of HDFS.

d. Simple Storage mechanism for datanodes

HDFS blocks simplify the storage of the datanodes. Metadata of all the blocks is maintained by namenode. The datanode doesn’t need to concern about the block metadata like file permissions etc.

Learn what is Rack Awareness in Hadoop HDFS from this comprehensive guide.

Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Data Blocks in Hadoop HDFS – Hadoop Distributed File System