Hadoop block size in production

Viewing 3 reply threads
  • Author
    Posts
    • #5642
      DataFlair TeamDataFlair Team
      Spectator

      On 100 nodes hadoop cluster with following configuration, ideally what should be the block size?

      64 GB RAM
      16 Cores processor
      40 TB hard-disks
    • #5643
      DataFlair TeamDataFlair Team
      Spectator

      Data Block size will depend upon the amount of input data to be processed as well as no. of tasks that can run parallel to process the data.
      If input data contains large files to be processed block size of 128 MB or 256 MB can be used.

      Follow the link to learn more about Data Block in Hadoop

    • #5645
      DataFlair TeamDataFlair Team
      Spectator

      Ideally the Data Block size is 64MB or 128 MB or even 256MB in some cases.It can be increased/decreased as per the requirement.
      Basically the size of block depends on the size of the original file.
      Larger the file,larger the block-size,so the file is divided into less no of large blocks and thus fast processing.
      Larger the block size,more amount of data will be processed in one time.If for a very large file,we’ll take small block size then we’ll have large no of small blocks and less data will be processed at a time ,hence the whole process will take more time to complete.
      So,while deciding the block size ,we can follow the principle as :
      “Less no of large files is better than large no of small files.”

      Now for the given configuration :
      100 nodes
      64 GB RAM
      16 Cores processor
      40 TB hard-disks

      if we take the example as below :
      Suppose we have B no of blocks,
      Case 1:we take block size as 128MB,then
      Total memory required= B*3*128 MB
      The given size is=100 nodes*40TB=100*40*1024*1024 MB
      Now,B*3*128=100*40*1024*1024 : B=10922667

      Case 2:we take block size as 256MB,then
      Total memory required= B*3*256 MB
      The given size is=100 nodes*40TB=100*40*1024*1024 MB
      Now,B*3*256=100*40*1024*1024 : B=5461333

      The value of B in case 2 is less when compared to case 1,hence the same amount of data will be processed in less time,so block size of 256 MB is more ideal in the current situation.

      Follow the link to learn more about Data Block in Hadoop

    • #5647
      DataFlair TeamDataFlair Team
      Spectator

      Block size depends on the datasets size and machine configurations available in your project.

      Ideally, we can go with below approach
      1. Start with 128MB and observe, then adjust according to the best results. If you have small files use 64MB, if you have large files use 128MB / 256mb.
      2. Larger the files, larger is the block size.
      3. The minimum replication factor of 3.

      It’s also good to merge these files either using HAR or rewrite into bigger files.

      If we have a very small block size, it means there is more number of blocksper file and which increases the metadata storage in Namenode memory.

      The block size affects MapReduce jobs as mapper takes an only single split at a time(in normal cases 1 split = 1 block). So if the block size is very small, more reads will be needed which affect performance. If the block size is very large, that will affect the parallelism of MR jobs.

Viewing 3 reply threads
  • You must be logged in to reply to this topic.