Data Read Operation in HDFS – A Quick HDFS Guide

1. Objective

This tutorial explains end to end HDFS data read operation. As data is stored in distributed manner, the reading operation will run in parallel. In this tutorial Understand, what is HDFS, HDFS read data flow, how the client interacts directly with the slaves and read data blocks from there?

Data Read Operation in HDFS - A Quick HDFS Guide

Data Read Operation in HDFS – A Quick HDFS Guide

Hadoop Quiz

2. HDFS Introduction

HDFS is the storage layer of Hadoop, which stores data quite reliable. HDFS splits the data into blocks and stores them distributedly over multiple nodes of the cluster. HDFS provides unique features like high availability, reliability, replication, fault tolerance, Scalability. Currently, HDFS is the most reliable storage system on the planet. To learn more about HDFS follow HDFS introduction tutorial

If these professionals can make a switch to Big Data, so can you:
Rahul Doddamani Story - DataFlair
Rahul Doddamani
Java → Big Data Consultant, JDA
Follow on
Mritunjay Singh Success Story - DataFlair
Mritunjay Singh
PeopleSoft → Big Data Architect, Hexaware
Follow on
Rahul Doddamani Success Story - DataFlair
Rahul Doddamani
Big Data Consultant, JDA
Follow on
I got placed, scored 100% hike, and transformed my career with DataFlair
Enroll now
Deepika Khadri Success Story - DataFlair
Deepika Khadri
SQL → Big Data Engineer, IBM
Follow on
DataFlair Web Services
You could be next!
Enroll now

3. HDFS Data Read Operation

Data Read Mechanism in HDFS Tutorial

Data Read Mechanism in HDFS Tutorial

Below is an end to end steps about parallel HDFS data read operation, how the client interacts with NameNode and DataNode, how client gets authenticated, how it actually read the data, how data read is fault tolerant?

i. Client interact with NameNode

As NameNode contains all the information regarding which block is stored on which particular slave in HDFS, which are the blocks for the specific file. Hence client needs to interact with the Namenode in order to get the address of the slaves where the blocks are actually stored. NameNode will provide the details of slaves which contain required blocks.

let’s understand client and namenode interaction in more details. To access the blocks stored in HDFS, a client initiates a request ‘open()‘ method to the FileSystem object, which is a part of DistributedFileSystem. Now the FileSystem will interact with the Namenode using RPC (Remote Procedure Call), to get the location of the slaves which contains the blocks of the file requested by the client.

Join DataFlair on Telegram

At this level, NameNode will check whether the client is authorized to access that file, if yes then it sends the information of the block locations. Also, it gives a security token to the client which they need to show to the slaves for authentication.

ii. Client interact with DataNode

Now after receiving the address of slaves containing blocks, the client will directly interact with Slave to read the blocks. The data will flow directly from the slaves to the client (it will not flow via Namenode). The client reads the data block from multiple slaves in parallel.

let’s understand client and Namenode interaction in more details. The client will send a read request to slaves via using the FSDataInputStream object. All the interactions between client and Datanode are managed by the DFSInputStream which is a part of client APIs. The client will show authentication token to slaves which were provided by Namenode. Now client will start reading the blocks using InputStream API, data will be transmitted continuously from datanode and client. After reaching an end of a block, the connection is closed to the datanode by the DFSInputStream.

Read operation is highly optimized, as it does not involve Namenode for actual data read, otherwise Namenode would have become the bottleneck. Due to distributed parallel read mechanism, thousands of clients can read the data directly from Datanodes very efficiently.

Work on HDFS and perform various operations (read, write, copy, move, change permission, etc.) follow this HDFS command list.


2 Responses

  1. Teresa says:

    I came to your Data Read Operation in HDFS – DataFlair page and would like to say it is very nicely described. Please share data write operation mechanism as well.

  2. Shashi says:

    Best description for hadoop read operation.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.