Features of Hadoop HDFS – An Overview for Beginners
This tutorial explains what are the features of Hadoop HDFS. Hadoop Distributed File System(HDFS) is the world’s most reliable storage system, which can store a large quantity of structured as well as unstructured data. HDFS provides reliable storage of data with its unique feature of Data Replication. HDFS is highly fault-tolerant, reliable, available, scalable, distributed file system. To work with HDFS follow this command guide.
2. HDFS Introduction
HDFS is a distributed file system which provides redundant storage space for files having huge sizes. It is based on Google’s Filesystem (GFS or GoogleFS). It is designed to run on commodity hardware. HDFS is highly fault-tolerant and it provides high-throughput access to application data. It is suitable for applications with large datasets. HDFS is accepted as world’s most reliable storage system. It is anticipated that by the end of 2017 75% of the data available on the planet will be residing in HDFS. To learn more about HDFS follow this introductory guide.
3. Features of Hadoop HDFS
3.1. Fault Tolerance
Fault tolerance in HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such situations. HDFS is highly fault-tolerant, in HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster (this replica creation is configurable). So whenever if any machine in the cluster goes down, then a client can easily access their data from the other machine which contains the same copy of data blocks. HDFS also maintains the replication factor by creating a replica of blocks of data on another rack. Hence if suddenly a machine fails, then a user can access data from other slaves present in another rack. To learn more about Fault Tolerance follow this Guide.
3.2. High Availability
HDFS is a highly available file system, data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in HDFS cluster. Hence whenever a user wants to access this data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. And during unfavorable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks which contain user data are created on the other nodes present in the HDFS cluster. To learn more about high availability follow this Guide.
3.3. Data Reliability
HDFS is a distributed file system which provides reliable data storage. HDFS can store data in the range of 100s of petabytes. It also stores data reliably on a cluster of nodes. HDFS divides the data into blocks and these blocks are stored on nodes present in HDFS cluster. It stores data reliably by creating a replica of each and every block present on the nodes present in the cluster and hence provides fault tolerance facility. If node containing data goes down, then a user can easily access that data from the other nodes which contain a copy of same data in the HDFS cluster. HDFS by default creates 3 copies of blocks containing data present in the nodes in HDFS cluster. Hence data is quickly available to the users and hence user does not face the problem of data loss. Hence HDFS is highly reliable.
Data Replication is one of the most important and unique features of Hadoop HDFS. In HDFS replication of data is done to solve the problem of data loss in unfavorable conditions like crashing of a node, hardware failure, and so on. Since data is replicated across a number of machines in the cluster by creating blocks. The process of replication is maintained at regular intervals of time by HDFS and HDFS keeps creating replicas of user data on different machines present in the cluster. Hence whenever any machine in the cluster gets crashed, the user can access their data from other machines which contain the blocks of that data. Hence there is no possibility of losing of user data. Follow this guide to learn more about the data read operation.
As HDFS stores data on multiple nodes in the cluster, when requirements increase we can scale the cluster. There is two scalability mechanism available: Vertical scalability – add more resources (CPU, Memory, Disk) on the existing nodes of the cluster. Another way is horizontal scalability – Add more machines in the cluster. The horizontal way is preferred since we can scale the cluster from 10s of nodes to 100s of nodes on the fly without any downtime.
3.6. Distributed Storage
In HDFS all the features are achieved via distributed storage and replication. HDFS data is stored in distributed manner across the nodes in HDFS cluster. In HDFS data is divided into blocks and is stored on the nodes present in HDFS cluster. And then replicas of each and every block are created and stored on other nodes present in the cluster. So if a single machine in the cluster gets crashed we can easily access our data from the other nodes which contain its replica.