What is HDFS Federation?

Viewing 2 reply threads
  • Author
    Posts
    • #5878
      DataFlair TeamDataFlair Team
      Spectator

      Why HDFS Federation came into existence?
      What is Hadoop HDFS Federation?

    • #5879
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop 1 architechture allowed only one name node to be present in a Hadoop ecosystem, a namenode contains namespace and block manager, by which it manages all files/directories/blocks stored in the data node. The namespace and blocks referred by it on data node is called namespace volume.
      Now, there were few drawbacks to this architechture , which are:-
      1) Since there is only one namenode, any problem/failure in the name node will result in system failure
      2) This also restricts the horizontal scalability of name node.
      3) Since there is only one name node, there is no isolation if multiple user groups are accessing the system, for example if a hadoop system is created for marketing and sales team, both teams have to use the same name node to access their data.

      To solve the above issues Hadoop 2 architechture introduced the concept of HDFS Federation , which allows a Hadoop ecosystem to have multiple name nodes. The name nodes present in the system DO NOT interact with each other and have their own specific block pools(blocks of data to be used only by a single name node and thus collectively called as namespace volume). All data nodes are integrated into common storage areas and stores data from different block pools of name nodes. Now if a particular name node goes down, the entire system would not crash, only the block pool of data of the crashed name node will not be accessible. This segregation of namespace volumes to provide isolation feature and scalability is called HDFS Federation.

      Follow the link for more detail: HDFS Federation in Hadoop

    • #5882
      DataFlair TeamDataFlair Team
      Spectator

      Hdfs Namespace was responsible for managing the directories, files, and blocks.
      It provides file operation related to Namespace like creating, deleting or modifying the files or the file directories.

      Storage Layer comprised two basic components.
      Block Management: It performs the following operations:
      1. Checks heartbeats of DataNodes periodically and it manages DataNode membership to the cluster.

      2. Manages the block reports and maintains block location. Supports block operations like creation, modification, deletion, and allocation of blocklocation.

      3. Maintains replication factor consistent throughout the cluster.

      The second layer was Physical Storage – It is managed by DataNodes which are responsible for storing data and thereby provides Read/Write access to the data stored in HDFS.

      So, in Hadoop 1.x the HDFS architecture allows you to have a single namespace for a cluster.
      In this architecture, a single NameNode is responsible for managing the namespace. This architecture is very convenient and easy to implement.

      Let’s look at the problems faced with this architecture:

      1) The namespace is not scalable like DataNodes. Hence, we can have only that number of DataNodes in the cluster that a single NameNode can handle.
      2) The two layers, i.e. Namespace layer and storage layer are tightly coupled which makes the alternate implementation of NameNode very difficult.
      3) The performance of the entire Hadoop System depends on the throughput of the NameNode.
      Therefore, entire performance of all the HDFS operations depends on how many tasks the NameNode can handle at a particular time.
      4) The NameNode stores the entire namespace in RAM for fast access.
      This leads to limitations in terms of memory size i.e. The number of namespace objects (files and blocks) that a single namespace server can cope up with.

      HDFS Federation Architecture:
      In HDFS Federation Architecture, we have horizontal scalability of name service.Therefore, we have multiple NameNodes which are federated, i.e. Independent from each other.The DataNodes are present at the bottom i.e. Underlying storage layer. Each DataNode registers with all the NameNodes in the cluster.The DataNodes transmit periodic heartbeats, block reports and handles commands from the NameNodes.

      There are multiple namespaces (NS1, NS2,…, NSn) and each of them is managed by its respective NameNode.
      Each namespace has its own block pool ( NS1 has Pool 1, NSk has Pool k and so on ).The blocks from pool 1 are stored on DataNode 1, DataNode 2 and so on.
      Similarly, all the blocks from each block pool will reside on all the DataNodes.

      Now, let’s understand the Block Pool of the HDFS Federation Architecture in detail:

      Block Pool:

      Block pool is set of blocks belonging to a specific Namespace in that Namenode
      So, we have a collection of block pool where each block pool is managed independently from the other.
      This independence where each block pool is managed independently allows the namespace to create Block IDs for new blocks without the coordination with other namespaces.
      The data blocks present in all the block pool are stored in all the DataNodes.
      Basically, block pool provides an abstraction such that the data blocks residing in the DataNodes can be grouped corresponding to a particular namespace

Viewing 2 reply threads
  • You must be logged in to reply to this topic.