Ideally what should be the block size in hadoop cluster?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 11:26 am #4637
  
  DataFlair Team
  Spectator
  
  What should be the block size in hadoop cluster?
- September 20, 2018 at 11:26 am #4638
  
  DataFlair Team
  Spectator
  
  Data Block in HDFS is a continuous location on the hard drive where data is stored, in general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster.
  Block Size
  There is no as such rule set by Hadoop to the bound user with certain block size. Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (may be 128MB or even 256MB) is best. But on the other hand for smaller files, using a smaller block size is better.
  
  So, here we are dealing with larger file large block & smaller file small blocks. In Industry we can get files of different sizes & we can have files with different block sizes on the same file system. This situation is overcome by using ”dfs.block.size” parameter when the file is written. It will help you in overriding default block size written in hdfs-site.xml
  
  Follow the link to more about Data Block in Hadoop
- September 20, 2018 at 11:27 am #4639
  
  DataFlair Team
  Spectator
  
  There is no correct formula for determining Data Block Size .
  The total time to read data from a disk consists of ’seek time’ which is finding the first block of the file and then ’transfer time’ which is the time it takes to read contiguous blocks of data. When the system is dealing with hundreds of terabyte or petabyte data, the time it takes to read from disk is important. There isn’t much improvement that can be done to reduce ’seek time’. However, if the block size is large then a significant amount of data can be read in one seek. This doesn’t mean that larger the Block size the better.
  Each block is processed by one Mapper. So if there are fewer blocks then all the nodes in the cluster may not get used. So one needs to strike a balance. 128 MBhas been found to be optimal. However, some applications may need larger or smaller Block size.
  
  Follow the link to more about Data Block in Hadoop
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

Ideally what should be the block size in hadoop cluster?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses