What should be the Block size to get maximum performance from Hadoop cluster?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What should be the Block size to get maximum performance from Hadoop cluster?

Viewing 4 reply threads
  • Author
    Posts
    • #6358
      DataFlair TeamDataFlair Team
      Spectator

      Ideally What should be the Block size to get maximum performance from Hadoop cluster?
      What are the effects when I increase / decrease block size ?
      What are factors I should consider when I change block size ?
      Is there any thumb rule to start with ?

    • #6360
      DataFlair TeamDataFlair Team
      Spectator

      Block
      Block is a continuous location on the hard drive where data is stored. In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster.
      Block Size
      There is no as such rule set by hadoop to the bound user with certain block size. Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (may be 128MB or even 256MB) is best. But on the other hand for smaller files, using a smaller block size is better.

      So we are talking about larger file large block & smaller file small blocks. In Industry we can get files of different sizes & we can have files with different block sizes on the same file system. So in order to overcome that situation ”dfs.block.size” parameter can be used when the file is written. It will help you in overriding default block size written in hdfs-site.xml

      There are few points to keep in mind for taking the decision on what should be block size, each thing has its own pros & cons:-

      • Most obviously, a file will have fewer blocks if the block size is larger. This can potentially make it possible for client to read/write more data without interacting with the Namenode which saves time.
      • Having larger blocks also reduces the metadata size of the Namenode, reducing Namenode load.
      • With fewer blocks, the file may potentially be stored on fewer nodes in total, this can reduce total throughput for parallel access
      • Having fewer & larger blocks, also means longer tasks which in turn may not gain maximum parallelism
      • Also while larger block is being processed and some failure occur more work need to be done.

      To learn more about the Block follow: Data Blocks

    • #6362
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop Distributed Filesystem– HDFS is the world’s most reliable storage system.HDFS is a Filesystem of Hadoop designed for storing very large files running on a cluster of commodity hardware.

      • In Hadoop, HDFS splits huge files into small chunks known as Blocks. These are the smallest unit of data in a filesystem. We (client and admin) do not have any control on the block like block location. Namenode decides all such things.
      • The default block size is, of course, 64MB for newly created files. But generally, recommend starting at 128MB instead.
      • The block size effects sequential read and writes sizes, it also has a direct impact on the map task performance due to how input splits are calculated, by default.
      • Each task is done by a JVM startup and assigned and scheduled – in other words, if you have a small block size, each map task will have very little to do. It schedules the task, finds the machine and starts the JVM – processes very small amount of data in this case and exits. The CPU gets faster but more tasks get added.
      • So in order to find a compromise between the no of tasks and each task is able to process a reasonable amount of data – while still getting the benefits of parallelism a 128 MB block size to start with is recommended.

      When the block size is smaller:

      • The more tasks you get,
      • The more scheduling activity occurs.
      • You don’t want jobs with hundreds of thousands of tasks

      Follow the link to learn more about HDFS and HDFS Blocks

    • #6363
      DataFlair TeamDataFlair Team
      Spectator

      The default block size has been changed from 64 MB to 128 MB in Hadoop 2.x
      We should know what impact does a lower block size or a higher block size makes in performance and then we can decide the block size.

      • When the block size is small, seek overhead increases as small size of block means the data when divided into blocks will be distributed in more number of blocks and as more blocks are created, there will be more number of seeks to read/write data from/to the blocks.Also, large number of blocks increases overhead for the name node as it required more memory to store the metadata.
      • When the block size is larger, then parallel processing takes a hit and the complete processing will take a very long time as data in one block may take large amount of time for processing.

      So, we should choose a moderate block size of 128 MB and then analyze and observe the performance of the cluster.We can then choose to increase/decrease the block size depending upon our observation.

      For more detail follow: Data Blocks in Hadoop

    • #6364
      DataFlair TeamDataFlair Team
      Spectator

      As we know the default Block size in Hadoop is 64/128 MB. We can change it in hdfs-site.xml by setting up the dfs.blocksize property.

      So lets list out the factors to decide the block size:
      1. Size of Input Files.
      2. Number of nodes(Size of Cluster).
      3. Map task Performance.
      4. Namenode Memory Management.

      Generally the recommended size is 128MB, as it moderate one.

      Now lets consider , what if the block size is less:

      1) Too small block size , too many splits, which will generate too many task beyond the cluster capacity.

      2) Increasing the Namenode metadata storage need. As it keeps the metadata of each object(150 bytes each)in memory.

      3) Small File Problems.

      So what if the block is more:

      1) Fewer splits, fewer blocks, results into fewer map task ,thus the cluster will not be utilised, which does not fully utilise the parallel and distributed nature of Hadoop.

      2) Less Parallel Processing.

      To learn more about the Block follow: Data Blocks

Viewing 4 reply threads
  • You must be logged in to reply to this topic.