Describe Read operation in HDFS?

This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 5:40 pm #6301
  
  DataFlair Team
  Spectator
  
  How data or file is read in HDFS?
- September 20, 2018 at 5:41 pm #6305
  
  DataFlair Team
  Spectator
  
  There are few important components we must understand before understanding the read operation in HDFS.
  HDFS client – This is the client which interacts with Namenodenode and Datanode on behalf of Actual client who makes a request of eiter reading or writing operation on data.
  Namenode: Namenode stores files meta data inforamtion. File metadata informations are file name, Block id, data node address where block resides, file permission etc.
  Datanode: Datanode is the actual node where data resides in the form of blocks.
  Blocks: Blocks are nothing but physical division of actual data. Actual data is splitted into the number of blocks and size of each block is same. By default, size of each block is either 128 MB or 64 MB depending upon the hadoop version.
  
  Now, lets understand high level read operation in HDFS.
  1. First client interacts with HDFS client and provides the file name which client wants to read.
  2. HDFS client calls the Namenode to get the block locations.
  3. Upon request of HDFS client, Namenode checks the client permission, if client has permission to read the file then it shares the blocks details to HDFS client. If client doesnt have the permission then it denies the request.
  4. After getting the list of blocks and addresses of Datanodes, hdfs clients starts interaction with Datanode to read the blocks.
  5. On completion of reading of blocks, client closes the connection.
  
  API level understanding of read operation in HDFS.
  1. To read the file from HDFS, client calls the open method on the DistributedFileSystem object. Then, DistributedFileSystem object interacts with Namenode to get the list of blocks with the addresses of Datanodes where this blocks are present.
  2. Namenode first verifies for authorization of user. If the user is authorized to perform read operation on file then Namenode returns the lists of blocks and for each block, Namenode returns the best possible datanode location. To get the best possible Datanode location, Namenode does sorting based on the distance of datanodes from the client.
  3. Upon getting the list of blocks with it’s locations, client calls the read method on the FSDataInputStream object. FSDataInputStream object starts reading the blocks from datanode.
  4. In case of any failure, during reading the block from Datanode, then in that case DistributedFileSystem object calls Namenode and asked for the next Datanode location to read the block. Then, Namenode will provide the next best possible Datanode address where the copy of this block is present.
  5. When client finishes reading of blocks, client calls the Close method on the FSDataInputStream object.
  
  Follow the link to learn more about: Read Operation in HDFS
- September 20, 2018 at 5:41 pm #6306
  
  DataFlair Team
  Spectator
  
  Below steps happend during read operation of HDFS:
  Step1: In client node, one client JVM daemon will use open method of Distributed File System class to ask namenode to provide the location of required data.
  Step2: Name node checks the access rights of the data to the user and if allowed then returns Block locations of the data.
  Step3: Now client JVM daemon use read method FS Data Input Stream class to access data from data nodes in parallel from various data nodes.
  Step4: After getting data from data nodes, client use close method to close read operation.
  
  Note: During Read operation if any data node failed then client ask name node to provide alternate node for that block of data.
  
  Follow the link to learn more about: Read Operation in HDFS
- September 20, 2018 at 5:41 pm #6308
  
  DataFlair Team
  Spectator
  
  Steps involved during the Read operation in HDFS
  
  1. Client submits a request to read a file to HDFS Client.
  
  2. HDFS Client initiates a request open() method to FileSystem object, which is part of DistributedFileSystem.
  
  3. Filesystem will interact with Namenode using Remote Procedure Call(RPC) to get the block location details.
  
  4. NameNode check the access credentials of the HDFS Client.
  
  4.1. If Authorised then the following details are passed back to calling object
  
  4.1.1.1. List of Block
  
  4.1.1.2. Address of DataNodes(DN) sorted by proximity to Client
  
  4.1.1.3. Security Token for authentication when accessing DataNodes
  
  4.2. If NOT Authorised then request is denied
  
  5. HDFS Client will initiate a request read() method of FSDataInputStream object to access data DIRECTLY from in parallel from various DataNodes. *Refer Notes*
  
  6. After reaching the end of the block the connection is closed by calling close() method.
  
  Notes
  
  a> All the interactions between HDFS Client and DataNode is managed by DFSInputStream.
  
  b) Read operation is highly optimised as the client reads data DIRECTLY from DataNodes.
  
  c) NameNode is not involved in the actual data read. This is to prevent NameNode becoming a bottleneck .
  
  d) Due to distributed parallel read mechanism thousands of clients can read the data directly from DataNodes very efficiently
  
  Follow the link to learn more about: Read Operation in HDFS
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

Describe Read operation in HDFS?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses