Data Block in HDFS is a continuous location on the hard drive where data is stored, in general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster. Block Size
There is no as such rule set by Hadoop to the bound user with certain block size. Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (may be 128MB or even 256MB) is best. But on the other hand for smaller files, using a smaller block size is better.
So, here we are dealing with larger file large block & smaller file small blocks. In Industry we can get files of different sizes & we can have files with different block sizes on the same file system. This situation is overcome by using ”dfs.block.size” parameter when the file is written. It will help you in overriding default block size written in hdfs-site.xml
There is no correct formula for determining Data Block Size .
The total time to read data from a disk consists of ’seek time’ which is finding the first block of the file and then ’transfer time’ which is the time it takes to read contiguous blocks of data. When the system is dealing with hundreds of terabyte or petabyte data, the time it takes to read from disk is important. There isn’t much improvement that can be done to reduce ’seek time’. However, if the block size is large then a significant amount of data can be read in one seek. This doesn’t mean that larger the Block size the better.
Each block is processed by one Mapper. So if there are fewer blocks then all the nodes in the cluster may not get used. So one needs to strike a balance. 128 MBhas been found to be optimal. However, some applications may need larger or smaller Block size.