HDFS Federation in Hadoop – Architecture and Benefits
1. HDFS Federation – Objective
This blog will take you through the HDFS Federation in Hadoop. In this block, we will cover the HDFS Federation Introduction, what is the motivation behind it? We will also discuss the current HDFS Architecture and its limitations which are overcome by HDFS federation,HDFS Federation architecture in Hadoop, Advantages of Hadoop Federation in this blog in detail.
2. What is Hadoop Federation?
Hadoop Distributed FileSystem-HDFS is the world’s most reliable storage system. HDFS is a FileSystem of Hadoop designed for storing very large files.
HDFS architecture follows master /slave topology. In which master is NameNode and slaves is DataNode. Namenode stores meta-data i.e. number of blocks, their location, replicas. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the slave nodes, and assigns tasks to them.
HDFS Federation enhances an existing HDFS architecture. In prior HDFS architecture for entire cluster allows only single namespace. In that configuration, Single NameNode manages namespace. If NameNode fails, the cluster as a whole would be out of services. The cluster will be unavailable until the NameNode restarts or brought on a separate machine.
Hadoop Federation overcomes this limitation by adding support for many NameNode/Namespaces to HDFS.
Read: Disk Balancer in HDFS
If these professionals can make a switch to Big Data, so can you:
3. Current HDFS Architecture
Hadoop HDFS has two main layers:
- Namespace– This layer manages files, directories, and blocks. This layer supports basic file system operation such as creation, deletion of files.
- Block Storage– It has two parts-
a. Block management
It supports block related operation such as creation, deletion of the blocks. It manages data nodes in the cluster and takes care of replication management.
b. Physical storage
This stores the blocks on the local file system and provides access to read or write operation. Follow this link to learn HDFS data read and write operation.
This current HDFS works fine for smaller setups. But, For large organizations where we need to take care of the huge amount of data has some limitation. Hadoop federation handles those limitations.
Read: HDFS Architecture in detail
4. Limitations of current HDFS Architecture
Below are some limitations of the current HDFS architecture which are overcome by Hadoop HDFS federation-
4.1. Tightly coupled block storage and Namespace
Namespace layer and storage layer are tightly coupled. It makes alternate implementation of namenode difficult. And it restricts other services to use block storage directly.
4.2. Namespace Scalability
The namespace is not scalable like datanode. Scaling in HDFS cluster is horizontally by adding datanodes. But we can’t add more namespace to an existing cluster. We can scale namespace vertically on a single namenode.
Hadoop entire performance depends on the throughput of the namenode. An operation of current file system depends on the throughput of a single namenode. NameNode at present supports 60,000 concurrent tasks. Upcoming MapReduce will have support for more than 1,00,000 concurrent tasks. And this will need more namenode.
There is no separation of the namespace. So there is no isolation among tenant organization that is using the cluster.
Read: HDFS Rack Awareness
5. HDFS Federation Architecture
Federation in Hadoop uses multiple independent Namenode/namespaces to scale the name service horizontally. In HDFS Federation Architecture, at the bottom, datanodes are present. And datanodes are used as common storage for blocks by all the namenodes. Each datanodes registers with all the namenodes in the cluster. These datanodes send periodic heartbeats, block report and handle command from the namenodes.
Many namenodes (NN1, NN2…, NNn) manages many namespaces (NS1, NS2…, NSn) respectively. Each namespace has its own block pool (NS1 Has pool 1and so on). Block from pool 1 is stored on datanode 1 and so on.
5.1. Block pool
Set of blocks is Block pool that belongs to a single namespace. There is a collection of pools in HDFS federation architecture. And each block is managed independently from other. This allows a namespace to create Block ID for new blocks without coordination with another namespace. All Datanodes stores data blocks present in all block pool.
5.2. Namespace volume
Namespace along with its block pool is Namespace volume. Many namespace volumes are there in HDFS federation. Each namespace volume works independently. When we delete namenode or namespace, then corresponding block pool present on the datanodes will also be deleted.
Read: HDFS NameNode High Availability
6. Benefits of HDFS Federation
Federation Hadoop overcomes the limitations of prior HDFS architecture. Hence HDFS Federation provides:
6.1. Isolation – There is no isolation in single namenode in a multi-user environment. In HDFS federation different categories of application and users can be isolated to different namespaces by using many namenodes.
6.2. Namespace Scalability – In federation many namenodes horizontally scales up in the filesystem namespace.
6.3. Performance – We can improve Read/write operation throughput by adding more namenodes.
7. HDFS Federation – Conclusion
In conclusion to HDFS Federation, we can say that it overcomes the limitation of single node HDFS architecture. In prior HDFS architecture for entire cluster allows only single namespace. While Hadoop Federation uses many independent Namenode/namespaces to scale the name service horizontally. It separates the namespace layer and the storage layer. Hence HDFS federation provides Isolation, Scalability and simple design.
If you like this post on HDFS federation or have any query, kindly inform us by leaving a comment in the section below. We will glad to solve them.