In HDFS What should be ideal block size to get best performance

This topic has 4 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 4 reply threads

Author

Posts
- September 20, 2018 at 4:06 pm #5747
  
  DataFlair Team
  Spectator
  
  We know When we copy any data in Hadoop it is divided into multiple blocks and Block size is configurable.
  What should be ideal block size to get best performance from hadoop (from HDFS as well as Map-Reduce)
- September 20, 2018 at 4:07 pm #5749
  
  DataFlair Team
  Spectator
  
  There is no hard and fast rule to keep 128MB (Default), 64MB or 256MB. This decision can be configured with the handling of the data set being encountered.
  
  For Medium Set Data (“Medium data” refers to datasets that are too large to fit on a single machine but don’t require enormous clusters of thousands of them: high terabytes, not petabytes.) it’s recommended to use 128MB (Default) Size.
  For Larger Set Data means the huge amount of data sets and require enormous clusters of thousands of machine, it’s recommended to use 256MB (Default) Size.
  
  This configuration entirely depends at which point your Hadoop Cluster is giving great performance if left with this option.
  HDFS Blocks are large by default to have larger transfer times of block when compared to “Seek times” in order to transfer larger files consisting of many blocks operating at “Disk transfer time”.
  
  Follow the link to learn more about: Data Blocks in Hadoop
- September 20, 2018 at 4:07 pm #5750
  
  DataFlair Team
  Spectator
  
  The default block size in Hadoop is 64/128 MB. But it is configurable in hdfs-site.xml with dfs.blocksize property.
  
  An ideal Data Blocks size is based on several factors: ClusterSize, Average input file, Map task capacity of the cluster.
  But general recommendation is starting block size at 128 MB.
  
  If the block size is too small there can be many issues:
  1) Too small block size would result in more number of unnecessary splits, which would result in more number of tasks which might be beyond the capacity of the cluster.
  
  2) Small block size also is a problem for Namenode since it keeps metadata of all blocks and it keeps metadata in memory.However, it keeps a persistent copy on disk also.Every block occupies 150 bytes in memory.Due to small block
  size Namenode can run out of memory.
  
  If the block size is too large there would be other issues:
  1) The cluster would be underutilized because of large block size there would be fewer splits and in turn would be fewer map tasks which will slow down the job.
  2) Large block size would decrease parallelism.
  
  Follow the link to learn more about: Data Blocks in Hadoop
- September 20, 2018 at 4:07 pm #5751
  
  DataFlair Team
  Spectator
  
  It depends on following factors:
  1.Size of data
  2.Configuration of machines
  The default Data Blocks size is, of course, 64MB , but generally, recommend is 128MB.
  
  if you have a very small block size, each map task has very little to do, individual map task run time gets shorter and the cost of having more tasks gets higher.
- September 20, 2018 at 4:07 pm #5753
  
  DataFlair Team
  Spectator
  
  The Data Blocks size varies for use cases. If the data is too large, then ideally the block size should be larger i.e 128 MB or 256 MB. As 1 mapper is used to process 1 block, this will increase the performance.
  If the source data is small ,preferably block size can be 64 MB as large number of small files will decrease the performance, it’s better to keep the block size small.
  
  Follow the link to learn more about: Data Blocks in Hadoop
Author

Posts

Viewing 4 reply threads

You must be logged in to reply to this topic.

In HDFS What should be ideal block size to get best performance

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses