Describe HDFS federation?

Viewing 2 reply threads
  • Author
    Posts
    • #5496
      DataFlair TeamDataFlair Team
      Spectator

      What is Hadoop Federation?
      Why HDFS Federation came into existence?
      What is Hadoop HDFS Federation?

    • #5499
      DataFlair TeamDataFlair Team
      Spectator

      In simple words, HDFS Federation is a way to enhance the current HDFS architecture. It provides a clear separation between the namespace and storage layer of the existing HDFS architecture. The two parts and their primary operations are as below:

      1. Namespace – This layer manages files, directories and blocks. This layer stores the metadata and supports the basic file system operations e.g. listing/creation/modification/deletion of files and folders.
      2. Block Storage – This layer is further divided in two parts –

      Block Management – This manages the datanodes in the cluster and provides file system operations/replication management.
      Physical Storage – This stores the blocks and provides access for read or write operations.
      To understand clearly, lets first analyse the existing HDFS architecture and its challenges:

      In the current architecture there is only one namespace in a single namenode which manages a cluster of datanodes. This architecture works well for small cluster size, however with the increase in cluster size there are lot of challenges with this model. The challenges/limitations are as follows:

      1. Tightly coupled Block Storage and Namespace- Due to this tight coupling, it makes difficult for other services to interact and utilize the block storage efficiently.

      2. Namespace Scalability- The cluster scales horizontally by adding more datanodes, however its not possible to scale namenode horizontally. However we can scale namenode vertically, but huge metadata of large cluster of datanodes makes it difficult to even scale vertcally in a single namenode machine.

      3. Performance- The current file system operations are limited to the throughput of a single name node which at present supports 60000 concurrent tasks.

      4. Isolation- In general the HDFS deployments are available on a multi-tenant environment where a single cluster is shared by multiple organizations. In this setup a separate namespace is not possible for one application or one organization.

      HDFS Fedration :

      To solve these chanllenges HDFS Federation came into the picture. HDFS Fedration helped the namenode scale horizontally. It uses several namenodes or namespaces which are independent of each other. These independent namenodes are federated i.e. they don’t require inter coordination.

      Each datanode is registered with all the namenodes in the cluster.

      Follow the link to learn more about HDFS Federation in Hadoop

    • #5500
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop 1.0 HDFS architecture:
      Two layers –
      1) Namespace- It manages files/directories and blocks.
      2) Block Storage- This layer has two parts –

      Block Management- This manages the datanodes in the cluster and provides operations like creation, deletion, modification and search. It also takes care of the replication management.
      Physical Storage -This stores the blocks and provides access for read or write operations.
      Need for HDFS Federation in Hadoop
      1) Limited namespace availability: Keeps all metadata in RAM, created overhead on memory
      2) Decreased metadata operation performance: Since it performs all metadata opeartions
      3) Lack of isolation: All metadata available at one single point.
      4) NameNode Scalability option: Since for every block NN stores some amount of data, more blocks means more overhead on NN memory

      What is Hadoop Federation

      1) HDFS Federation enhances Hadoop 1.0 HDFS architecture. It provides a clear separation between namespace and storage thus enables scalability and isolation at the cluster level.

      2) Hadoop federation separates the namespace layer and storage layer.
      3) It has multiple independent Namenodes each with namespace layer and storage layer.
      4) The NameNodes in Hadoop Federation do not talk to each other.
      5) Each namespace manage only particular slice of data.
      6) Datanodes on the other hand can store blocks managed by any namenode.
      7) Since there are multiple namespaces and namenodes, the end user can use any of them to create their own view of HDFS.

      The failure of Namenode still becomes the single point of failure (SPOF) which gives motivation to the introduction of Hadoop 2.0 High Availability feature
      1) Hadoop 2.0 overcomes SPOF problem by introducing an extra NameNode (Passive Standby NameNode) to the Hadoop Architecture which is configured for automatic failover thus providing Hadoop 2.0 High Availability feature.
      2) Hadoop 2.0 High Availability project is designed to render availability to big data applications 24/7 by deploying 2 Hadoop NameNodes –One in active configuration and the other is the Standby Node in passive configuration.

      Follow the link to learn more about HDFS Federation in Hadoop

Viewing 2 reply threads
  • You must be logged in to reply to this topic.