In HDFS What should be ideal block size to get best performance

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop In HDFS What should be ideal block size to get best performance

Viewing 4 reply threads
  • Author
    Posts
    • #5747
      DataFlair TeamDataFlair Team
      Spectator

      We know When we copy any data in Hadoop it is divided into multiple blocks and Block size is configurable.
      What should be ideal block size to get best performance from hadoop (from HDFS as well as Map-Reduce)

    • #5749
      DataFlair TeamDataFlair Team
      Spectator

      There is no hard and fast rule to keep 128MB (Default), 64MB or 256MB. This decision can be configured with the handling of the data set being encountered.

      For Medium Set Data (“Medium data” refers to datasets that are too large to fit on a single machine but don’t require enormous clusters of thousands of them: high terabytes, not petabytes.) it’s recommended to use 128MB (Default) Size.
      For Larger Set Data means the huge amount of data sets and require enormous clusters of thousands of machine, it’s recommended to use 256MB (Default) Size.

      This configuration entirely depends at which point your Hadoop Cluster is giving great performance if left with this option.
      HDFS Blocks are large by default to have larger transfer times of block when compared to “Seek times” in order to transfer larger files consisting of many blocks operating at “Disk transfer time”.

      Follow the link to learn more about: Data Blocks in Hadoop

    • #5750
      DataFlair TeamDataFlair Team
      Spectator

      The default block size in Hadoop is 64/128 MB. But it is configurable in hdfs-site.xml with dfs.blocksize property.

      An ideal Data Blocks size is based on several factors: ClusterSize, Average input file, Map task capacity of the cluster.
      But general recommendation is starting block size at 128 MB.

      If the block size is too small there can be many issues:
      1) Too small block size would result in more number of unnecessary splits, which would result in more number of tasks which might be beyond the capacity of the cluster.

      2) Small block size also is a problem for Namenode since it keeps metadata of all blocks and it keeps metadata in memory.However, it keeps a persistent copy on disk also.Every block occupies 150 bytes in memory.Due to small block
      size Namenode can run out of memory.

      If the block size is too large there would be other issues:
      1) The cluster would be underutilized because of large block size there would be fewer splits and in turn would be fewer map tasks which will slow down the job.
      2) Large block size would decrease parallelism.

      Follow the link to learn more about: Data Blocks in Hadoop

    • #5751
      DataFlair TeamDataFlair Team
      Spectator

      It depends on following factors:
      1.Size of data
      2.Configuration of machines
      The default Data Blocks size is, of course, 64MB , but generally, recommend is 128MB.

      if you have a very small block size, each map task has very little to do, individual map task run time gets shorter and the cost of having more tasks gets higher.

    • #5753
      DataFlair TeamDataFlair Team
      Spectator

      The Data Blocks size varies for use cases. If the data is too large, then ideally the block size should be larger i.e 128 MB or 256 MB. As 1 mapper is used to process 1 block, this will increase the performance.
      If the source data is small ,preferably block size can be 64 MB as large number of small files will decrease the performance, it’s better to keep the block size small.

      Follow the link to learn more about: Data Blocks in Hadoop

Viewing 4 reply threads
  • You must be logged in to reply to this topic.