Introduction to HDFS Federation in Hadoop

1. Objective

This blog will take you through the HDFS Federation in Hadoop. In this block, we will cover the HDFS Federation Introduction, what is the motivation behind it? We will also discuss the current HDFS Architecture and its limitations which are overcome by HDFS federation, Architecture of HDFS Federation in Hadoop, Advantages of HDFS Federation in this blog in detail.

What is HDFS Federation in Apache Hadoop?

2. What is HDFS Federation?

Hadoop Distributed FileSystem-HDFS is the world’s most reliable storage system. HDFS is a FileSystem of Hadoop designed for storing very large files.

HDFS architecture follows master /slave topology. In which master is NameNode and slaves is DataNode. Namenode stores meta-data i.e. number of blocks, their location, replicas. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the slave nodes, and assigns tasks to them.

HDFS Federation enhances an existing HDFS architecture. In prior HDFS architecture for entire cluster allows only single namespace. In that configuration, Single NameNode manages namespace. If NameNode fails, the cluster as a whole would be out of services. The cluster will be unavailable until the NameNode restarts or brought on a separate machine.

Federation overcomes this limitation by adding support for many NameNode/Namespaces to HDFS.

3. Current HDFS Architecture

Hadoop HDFS has two main layers:

To main layers of HDFS

  1. Namespace– This layer manages files, directories, and blocks. This layer supports basic file system operation such as creation, deletion of files.
  2. Block Storage– It has two parts-

a. Block management

It supports block related operation such as creation, deletion of the blocks. It manages data nodes in the cluster and takes care of replication management.

b. Physical storage

This stores the blocks on the local file system and provides access to read or write operation. Follow this link to learn HDFS data read and write operation.

This current HDFS works fine for smaller setups. But, For large organizations where we need to take care of the huge amount of data has some limitation. Hadoop federation handles those limitations.

4. Limitations of current HDFS Architecture

Below are some limitations of the current HDFS architecture which are overcome by Hadoop federation-

4.1. Tightly coupled block storage and Namespace

Namespace layer and storage layer are tightly coupled. It makes alternate implementation of namenode difficult. And it restricts other services to use block storage directly.

4.2. Namespace Scalability

The namespace is not scalable like datanode. Scaling in HDFS cluster is horizontally by adding datanodes. But we can’t add more namespace to an existing cluster. We can scale namespace vertically on a single namenode.

4.3. Performance

Hadoop entire performance depends on the throughput of the namenode. An operation of current file system depends on the throughput of a single namenode. NameNode at present supports 60,000 concurrent tasks. Upcoming MapReduce will have support for more than 1,00,000 concurrent tasks. And this will need more namenode.

4.4. Isolation

There is no separation of the namespace. So there is no isolation among tenant organization that is using the cluster.

5. HDFS Federation Architecture

Federation uses multiple independent Namenode/namespaces to scale the name service horizontally. In HDFS Federation Architecture, at the bottom, datanodes are present. And datanodes are used as common storage for blocks by all the namenodes. Each datanodes registers with all the namenodes in the cluster. These datanodes send periodic heartbeats, block report and handle command from the namenodes.

An HDFS Federation Architecture in Detail

Many namenodes (NN1, NN2…, NNn) manages many namespaces (NS1, NS2…, NSn) respectively. Each namespace has its own block pool (NS1 Has pool 1and so on). Block from pool 1 is stored on datanode 1 and so on.

5.1. Block pool

Set of blocks is Block pool that belongs to a single namespace. There is a collection of pools in HDFS federation architecture. And each block is managed independently from other. This allows a namespace to create Block ID for new blocks without coordination with another namespace. All Datanodes stores data blocks present in all block pool.

5.2. Namespace volume

Namespace along with its block pool is Namespace volume. Many namespace volumes are there in HDFS federation. Each namespace volume works independently. When we delete namenode or namespace, then corresponding block pool present on the datanodes will also be deleted.

6. Benefits of HDFS Federation

HDFS Federation overcomes the limitations of prior HDFS architecture. Hence it provides:

6.1. Isolation – There is no isolation in single namenode in a multi-user environment. In HDFS federation different categories of application and users can be isolated to different namespaces by using many namenodes.

6.2. Namespace Scalability – In federation many namenodes horizontally scales up in the filesystem namespace.

6.3. Performance – We can improve Read/write operation throughput by adding more namenodes.

7. Conclusion

In conclusion to HDFS Federation, we can say that it overcomes the limitation of single node HDFS architecture. In prior HDFS architecture for entire cluster allows only single namespace. While Federation uses many independent Namenode/namespaces to scale the name service horizontally. It separates the namespace layer and the storage layer. Hence provides Isolation, Scalability and simple design.

If you like this post or have any query, kindly inform us by leaving a comment in the section below. We will glad to solve them.

See Also-

Leave a comment

Your email address will not be published. Required fields are marked *