How blocks are distributed among all data nodes for a particular chunk of data?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop How blocks are distributed among all data nodes for a particular chunk of data?

Viewing 2 reply threads
  • Author
    Posts
    • #6098
      DataFlair TeamDataFlair Team
      Spectator

      On what factor blocks are distributed in hadoop cluster ? When we write a file in hdfs it is divided into smaller blocks then stored distributedly. suppose a file is divided into blocks B1, B2, B3, B4. Now how these blocks will be distributed in the cluster ?

    • #6100
      DataFlair TeamDataFlair Team
      Spectator

      Block are getting the store in the Hadoop cluster is depends on the below two configuration setup.

      1) Current block size configuration.
      2) Current Replication factor configuration.

      1) Current block size configuration.

      We can change or set the block size via editing the hdfs-site.xml by changing value of dfs.block.size property.

      Sample : Currently here it is 128 MB

      <property>
      <name>dfs.block.size<name>
      <value>134217728<value>
      <description>Block size<description>
      <property>

      2) Current Replication factor configuration.

      We can change or set the block size via editing the hdfs-site.xml by changing value of dfs.replication property.

      Sample : Currently here it is 3 replica per block.

      <configuration>
      <property>
      <name>dfs.replication</name>
      <value>3</value>
      </property>

      Actually one file is distributed in same block size and only the last block will be differ (typically last block will be less that configured block size). And will have the configured replicas in hadoop cluster.

      For example:

      We have 300 MB of file need to store in HDFS then it will be distributed as per my mentioned configuration as below.

      Block * Replica
      128 * 3
      128 * 3
      44 * 3

      Note: We can’t manage which replica goes to which node. It’s all the work of Hadoop frame work

    • #6101
      DataFlair TeamDataFlair Team
      Spectator

      In HDFS , files are stored in the form of Block (Physical division of data) and the block size is defined in the hdfs-site.xml which is typically 64MB. while storing the file in the HDFS, it is divided into the blocks of 64/128/256 MB depending on the requirement and defined in hdfs-site.xml and these blocks are stored on different data nodes.

      Once the block is written on the DataNode, the replication takes place and a replica of the block gets saved on other DataNodes to ensure the data availability in case of any failures. How many replicas will be created depends on the replication factor defined in the hdfs-site.xml (usually 3).

      The replica will be saved on different data nodes and in the rack setup, at least one copy of the block will be saved on different rack.

      Memory consumption:

      Assume block size is 64MB, File size is 780MB and Replication factor is 3.

      Number of Blocks:780/64 =13 (Last Block will not be a 12MB block to avoid the wastage of space )

      Total Memory to store the block and its replica = (64*12*3)+ (12*3) = 2340 MB.

      Though it is consuming more space but it also ensures that data is protected and remains available in case of any node is dead,hence HDFS is most reliable file system.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.