Hadoop block size in production

This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 3:50 pm #5642
  DataFlair Team
  Spectator
  On 100 nodes hadoop cluster with following configuration, ideally what should be the block size?
```
64 GB RAM
16 Cores processor
40 TB hard-disks
```
- September 20, 2018 at 3:50 pm #5643
  
  DataFlair Team
  Spectator
  
  Data Block size will depend upon the amount of input data to be processed as well as no. of tasks that can run parallel to process the data.
  If input data contains large files to be processed block size of 128 MB or 256 MB can be used.
  
  Follow the link to learn more about Data Block in Hadoop
- September 20, 2018 at 3:51 pm #5645
  
  DataFlair Team
  Spectator
  
  Ideally the Data Block size is 64MB or 128 MB or even 256MB in some cases.It can be increased/decreased as per the requirement.
  Basically the size of block depends on the size of the original file.
  Larger the file,larger the block-size,so the file is divided into less no of large blocks and thus fast processing.
  Larger the block size,more amount of data will be processed in one time.If for a very large file,we’ll take small block size then we’ll have large no of small blocks and less data will be processed at a time ,hence the whole process will take more time to complete.
  So,while deciding the block size ,we can follow the principle as :
  “Less no of large files is better than large no of small files.”
  
  Now for the given configuration :
  100 nodes
  64 GB RAM
  16 Cores processor
  40 TB hard-disks
  
  if we take the example as below :
  Suppose we have B no of blocks,
  Case 1:we take block size as 128MB,then
  Total memory required= B*3*128 MB
  The given size is=100 nodes*40TB=100*40*1024*1024 MB
  Now,B*3*128=100*40*1024*1024 : B=10922667
  
  Case 2:we take block size as 256MB,then
  Total memory required= B*3*256 MB
  The given size is=100 nodes*40TB=100*40*1024*1024 MB
  Now,B*3*256=100*40*1024*1024 : B=5461333
  
  The value of B in case 2 is less when compared to case 1,hence the same amount of data will be processed in less time,so block size of 256 MB is more ideal in the current situation.
  
  Follow the link to learn more about Data Block in Hadoop
- September 20, 2018 at 3:51 pm #5647
  
  DataFlair Team
  Spectator
  
  Block size depends on the datasets size and machine configurations available in your project.
  
  Ideally, we can go with below approach
  1. Start with 128MB and observe, then adjust according to the best results. If you have small files use 64MB, if you have large files use 128MB / 256mb.
  2. Larger the files, larger is the block size.
  3. The minimum replication factor of 3.
  
  It’s also good to merge these files either using HAR or rewrite into bigger files.
  
  If we have a very small block size, it means there is more number of blocksper file and which increases the metadata storage in Namenode memory.
  
  The block size affects MapReduce jobs as mapper takes an only single split at a time(in normal cases 1 split = 1 block). So if the block size is very small, more reads will be needed which affect performance. If the block size is very large, that will affect the parallelism of MR jobs.
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

Hadoop block size in production

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses