Apache Hadoop HDFS – An Introduction to HDFS
In this Hadoop tutorial, we will discuss World’s most reliable storage system – HDFS (Hadoop Distributed File System). HDFS is Hadoop’s storage layer which provides high availability, reliability and fault tolerance. It is anticipated that world’s 75% of data will be stored in Hadoop HDFS by the end of 2017. This tutorial will provide a complete overview of what is HDFS? This introductory guide will cover basics of HDFS, HDFS introduction, HDFS nodes, HDFS daemons, etc.
2. What is Hadoop HDFS?
Apache Hadoop HDFS is a distributed file system which provides redundant storage space for storing files which are huge in sizes; files which are in the range of Terabytes and Petabytes. In HDFS data is stored reliably. Files are broken into blocks and distributed across nodes in a cluster. After that each block is replicated, means copies of blocks are created on different machines. Hence if a machine goes down or gets crashed, then also we can easily retrieve and access our data from different machines. By default, 3 copies of a file are created on different machines. Hence it is highly fault-tolerant. HDFS provides faster file read and writes mechanism, as data is stored in different nodes in a cluster. Hence the user can easily access the data from any machine in a cluster. Hence HDFS is highly used as a platform for storing huge volume and different varieties of data worldwide.
Before working with HDFS you must have Hadoop installed and running, to install and configure Hadoop follow this Installation Guide.
If these professionals can make a switch to Big Data, so can you:
3. HDFS Nodes
HDFS has Master/slave architecture. There are two nodes in HDFS: Master and Slaves. The master node maintains various data storage and processing management services in distributed Hadoop clusters. The actual data in HDFS is stored in Slave nodes. Data is also processed on the slave nodes.
Master is the centerpiece of HDFS. It stores the metadata of HDFS. All the information related to files stored in HDFS gets stored in Master. It also gives information about where across the cluster the file data is kept. Master contains information about the details of the blocks and its location for all files present in HDFS. The idea of constructing the file from blocks comes with the help of this information to the master. Master is the most critical part of HDFS and if all the masters get crashed or down then the HDFS cluster is also considered down and becomes useless.
The actual files or the data of client is present on the slaves. The most important and useful functionality of slaves is to control storage attached to the nodes in which they run. As we know that, in HDFS files are broken down into smaller blocks and these blocks are distributed across nodes in the cluster. The slaves within the cluster manage these file blocks. And in order to perform all filesystem operations, it sends information to the Master about the blocks present. HDFS has more than one slaves, and the replicas of blocks are created across them.
Learn the Internals of HDFS Data Read Operation, Follow this tutorial to understand How Data flows in HDFS while reading the file
4. HDFS Daemons
In Hadoop HDFS there are three daemons. All the daemons run on their own JVMs in the background to support required services.
Namenode is the master daemon of HDFS which runs on all the masters. It manages the HDFS filesystem namespace. NameNode keeps the record of all the files present in the HDFS. NameNode also keeps the record of the changes created in file system namespace.
Datanode is the slave daemon of HDFS which runs on all the slaves. The function of DataNode is to store data in the HDFS. It contains the actual data blocks. HDFS cluster usually has more than one DataNodes. Data is replicated across the other machines present in the HDFS cluster.
The job of SecondaryNameNode is to perform backup and record-keeping functions for the NameNode. Secondary Namenode periodically pulls the data from namenode, so if namemode goes down we can manually make secondary NN as Namenode. One important point, it is not a hot standby of namenode.
5. How Data gets Stored in HDFS
In Hadoop HDFS data files are divided into smaller chunks called blocks. Now, these blocks are then distributed across a group of machines which are known as slaves. Here slave machines create replica of these blocks and distribute across other machines in the cluster. Now individual slaves send reports to the master containing information about the files and blocks stored on them. When slaves receive instructions like add/copy/move/delete, etc. from the master then slaves performs the particular operations on the file system. After this, the slave sends a report to the master regarding completion of the task. Learn Internals of HDFS Data Write Pipeline and File write execution flow
6. Blocks in HDFS
Blocks in HDFS is the segment of a file. These segments of files get stored on the nodes present in HDFS cluster and now the replicas of these blocks are created on the other nodes in the cluster. The data stored in HDFS is split by the framework. The default block size in HDFS is 128 MB. We can increase the blocks size as per the requirements. These blocks are distributed across different machines. Now replicas of these blocks are created on different machines in the cluster. By default minimum, three copies of a block are created (which is configurable) on other machines. So if a machine goes down, then blocks stored on that machine can be accessed from other two machines.
7. Heartbeat Message
All the slaves send a message to the masters just like a heartbeat in every 3 seconds to inform that they are alive. If no heartbeat message is received by masters from any particular slave for more than 10 minutes, then it considers that slave has failed and now it is not working and hence it start creating a replication of blocks which were available on that slave. Now the slaves can talk to each other to rebalance data, by moving and copying the data to each other to keep the required replication. As the environment is distributed there should be some mechanism from which master will come to know the current status of all the slaves in the cluster. Hence all the slaves continuously send a small heartbeat message (signals) to master to tell “I am Alive”. if master found any machine dead it will not allocate any new work submitted by the client.