Hadoop High Availability – HDFS Feature
In this Hadoop tutorial, we will discuss the Hadoop High Availability feature. The tutorial covers an introduction to Hadoop High Availability, how high availability is achieved in Hadoop, what were the issues in legacy systems, and examples of High Availability in Hadoop.
2. Hadoop HDFS High Availability – Introduction
HDFS is a distributed file system. It distributes data among the nodes in the cluster by creating a replica of the file. These replicas of files are stored on the other machines present in the HDFS cluster. Hence whenever a user wants to access his data, he can access that data from a number of machines present in the cluster which is easily available in the closest node in the cluster. Also during some unfavorable conditions like a failure of a node, a user can easily access their data from the other nodes. Because HDFS creates a replica of user data on the other nodes present in the HDFS cluster. To learn more about world’s most reliable storage layer follow this HDFS introductory guide.
3. How is High Availability achieved in Hadoop HDFS?
As there are a number of DataNodes in the HDFS cluster and after a definite interval of time all these DataNodes sends heartbeat messages to the NameNode and if the NameNode stops receiving heartbeat messages from any of these DataNodes, then it assumes it to be dead. And it then checks for the data present in those nodes, then it gives commands to the other datanodes (having same data, which was available on the failed node) to create a replica of that data to other datanodes. Hence data is always available.
So whenever a user asks for a data access in HDFS, then NameNode first of all searches for the data in that datanodes, in which data is quickly available and provides access to that data to the user. Users do not have to search for the data in all the datanodes. Namenode itself makes data availability easy to the users by providing the address of the datanode from where a user can directly read. Learn more about Internals of HDFS Data Read Operation.
4. Example of Hadoop HDFS High Availability
HDFS provides High availability of data. Whenever user requests for data access to the NameNode, then the NameNode searches for all the nodes in which that data is available. And then provides access to that data to the user from the node in which data was quickly available. While searching for data on all the nodes in the cluster, if NameNode finds some node to be dead, then without user knowledge NameNode redirects the user to the other node in which the same data is available. Without any interruption, data is made available to the user. So in conditions of node failure also data is highly available to the users. Also, any individual node failure does not affect applications. Learn HDFS Read write operations.
5. What were the Issues in legacy systems?
- Data unavailable due to the crashing of a machine.
- Users have to wait for a long period of time to access their data, sometimes users have to wait for a particular period of time till the website becomes up.
- Due to unavailability of data, completion of many major projects at organizations gets extended for a long period of time. Hence companies have to go through critical situations.
- Limited features and functionalities.