This tutorial explains end to end HDFS data read operation. As data is stored in distributed manner, the reading operation will run in parallel. In this tutorial Understand, what is HDFS, HDFS read data flow, how the client interacts directly with the slaves and read data blocks from there?
2. Video Tutorial
3. HDFS Introduction
HDFS is the storage layer of Hadoop, which stores data quite reliable. HDFS splits the data into blocks and stores them distributedly over multiple nodes of the cluster. HDFS provides unique features like high availability, reliability, replication, fault tolerance, Scalability. Currently, HDFS is the most reliable storage system on the planet. To learn more about HDFS follow HDFS introduction tutorial
4. HDFS Data Read Operation
Below is an end to end steps about parallel HDFS data read operation, how the client interacts with NameNode and DataNode, how client gets authenticated, how it actually read the data, how data read is fault tolerant?
i. Client interact with NameNode
As NameNode contains all the information regarding which block is stored on which particular slave in HDFS, which are the blocks for the specific file. Hence client needs to interact with the Namenode in order to get the address of the slaves where the blocks are actually stored. NameNode will provide the details of slaves which contain required blocks.
let’s understand client and namenode interaction in more details. To access the blocks stored in HDFS, a client initiates a request ‘open()‘ method to the FileSystem object, which is a part of DistributedFileSystem. Now the FileSystem will interact with the Namenode using RPC (Remote Procedure Call), to get the location of the slaves which contains the blocks of the file requested by the client.
At this level, NameNode will check whether the client is authorized to access that file, if yes then it sends the information of the block locations. Also, it gives a security token to the client which they need to show to the slaves for authentication.
ii. Client interact with DataNode
Now after receiving the address of slaves containing blocks, the client will directly interact with Slave to read the blocks. The data will flow directly from the slaves to the client (it will not flow via Namenode). The client reads the data block from multiple slaves in parallel.
let’s understand client and Namenode interaction in more details. The client will send a read request to slaves via using the FSDataInputStream object. All the interactions between client and Datanode are managed by the DFSInputStream which is a part of client APIs. The client will show authentication token to slaves which were provided by Namenode. Now client will start reading the blocks using InputStream API, data will be transmitted continuously from datanode and client. After reaching an end of a block, the connection is closed to the datanode by the DFSInputStream.
Read operation is highly optimized, as it does not involve Namenode for actual data read, otherwise Namenode would have become the bottleneck. Due to distributed parallel read mechanism, thousands of clients can read the data directly from Datanodes very efficiently.
Work on HDFS and perform various operations (read, write, copy, move, change permission, etc.) follow this HDFS command list.