How blocks are distributed among all data nodes for a particular chunk of data?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 5:07 pm #6098
  
  DataFlair Team
  Spectator
  
  On what factor blocks are distributed in hadoop cluster ? When we write a file in hdfs it is divided into smaller blocks then stored distributedly. suppose a file is divided into blocks B1, B2, B3, B4. Now how these blocks will be distributed in the cluster ?
- September 20, 2018 at 5:07 pm #6100
  DataFlair Team
  Spectator
  Block are getting the store in the Hadoop cluster is depends on the below two configuration setup.
  
  1) Current block size configuration.
  2) Current Replication factor configuration.
  
  1) Current block size configuration.
  
  We can change or set the block size via editing the hdfs-site.xml by changing value of dfs.block.size property.
  
  Sample : Currently here it is 128 MB
```
<property>
<name>dfs.block.size<name>
<value>134217728<value>
<description>Block size<description>
<property>
```
  2) Current Replication factor configuration.
  
  We can change or set the block size via editing the hdfs-site.xml by changing value of dfs.replication property.
  
  Sample : Currently here it is 3 replica per block.
```
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
```
  Actually one file is distributed in same block size and only the last block will be differ (typically last block will be less that configured block size). And will have the configured replicas in hadoop cluster.
  
  For example:
  
  We have 300 MB of file need to store in HDFS then it will be distributed as per my mentioned configuration as below.
  
  Block * Replica
  128 * 3
  128 * 3
  44 * 3
  
  Note: We can’t manage which replica goes to which node. It’s all the work of Hadoop frame work
- September 20, 2018 at 5:07 pm #6101
  
  DataFlair Team
  Spectator
  
  In HDFS , files are stored in the form of Block (Physical division of data) and the block size is defined in the hdfs-site.xml which is typically 64MB. while storing the file in the HDFS, it is divided into the blocks of 64/128/256 MB depending on the requirement and defined in hdfs-site.xml and these blocks are stored on different data nodes.
  
  Once the block is written on the DataNode, the replication takes place and a replica of the block gets saved on other DataNodes to ensure the data availability in case of any failures. How many replicas will be created depends on the replication factor defined in the hdfs-site.xml (usually 3).
  
  The replica will be saved on different data nodes and in the rack setup, at least one copy of the block will be saved on different rack.
  
  Memory consumption:
  
  Assume block size is 64MB, File size is 780MB and Replication factor is 3.
  
  Number of Blocks:780/64 =13 (Last Block will not be a 12MB block to avoid the wastage of space )
  
  Total Memory to store the block and its replica = (64*12*3)+ (12*3) = 2340 MB.
  
  Though it is consuming more space but it also ensures that data is protected and remains available in case of any node is dead,hence HDFS is most reliable file system.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

How blocks are distributed among all data nodes for a particular chunk of data?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses