Describe Read operation in HDFS?

Viewing 3 reply threads
  • Author
    Posts
    • #6301
      DataFlair TeamDataFlair Team
      Spectator

      How data or file is read in HDFS?

    • #6305
      DataFlair TeamDataFlair Team
      Spectator

      There are few important components we must understand before understanding the read operation in HDFS.
      HDFS client – This is the client which interacts with Namenodenode and Datanode on behalf of Actual client who makes a request of eiter reading or writing operation on data.
      Namenode: Namenode stores files meta data inforamtion. File metadata informations are file name, Block id, data node address where block resides, file permission etc.
      Datanode: Datanode is the actual node where data resides in the form of blocks.
      Blocks: Blocks are nothing but physical division of actual data. Actual data is splitted into the number of blocks and size of each block is same. By default, size of each block is either 128 MB or 64 MB depending upon the hadoop version.

      Now, lets understand high level read operation in HDFS.
      1. First client interacts with HDFS client and provides the file name which client wants to read.
      2. HDFS client calls the Namenode to get the block locations.
      3. Upon request of HDFS client, Namenode checks the client permission, if client has permission to read the file then it shares the blocks details to HDFS client. If client doesnt have the permission then it denies the request.
      4. After getting the list of blocks and addresses of Datanodes, hdfs clients starts interaction with Datanode to read the blocks.
      5. On completion of reading of blocks, client closes the connection.

      API level understanding of read operation in HDFS.
      1. To read the file from HDFS, client calls the open method on the DistributedFileSystem object. Then, DistributedFileSystem object interacts with Namenode to get the list of blocks with the addresses of Datanodes where this blocks are present.
      2. Namenode first verifies for authorization of user. If the user is authorized to perform read operation on file then Namenode returns the lists of blocks and for each block, Namenode returns the best possible datanode location. To get the best possible Datanode location, Namenode does sorting based on the distance of datanodes from the client.
      3. Upon getting the list of blocks with it’s locations, client calls the read method on the FSDataInputStream object. FSDataInputStream object starts reading the blocks from datanode.
      4. In case of any failure, during reading the block from Datanode, then in that case DistributedFileSystem object calls Namenode and asked for the next Datanode location to read the block. Then, Namenode will provide the next best possible Datanode address where the copy of this block is present.
      5. When client finishes reading of blocks, client calls the Close method on the FSDataInputStream object.

      Follow the link to learn more about: Read Operation in HDFS

    • #6306
      DataFlair TeamDataFlair Team
      Spectator

      Below steps happend during read operation of HDFS:
      Step1: In client node, one client JVM daemon will use open method of Distributed File System class to ask namenode to provide the location of required data.
      Step2: Name node checks the access rights of the data to the user and if allowed then returns Block locations of the data.
      Step3: Now client JVM daemon use read method FS Data Input Stream class to access data from data nodes in parallel from various data nodes.
      Step4: After getting data from data nodes, client use close method to close read operation.

      Note: During Read operation if any data node failed then client ask name node to provide alternate node for that block of data.

      Follow the link to learn more about: Read Operation in HDFS

    • #6308
      DataFlair TeamDataFlair Team
      Spectator

      Steps involved during the Read operation in HDFS

      1. Client submits a request to read a file to HDFS Client.

      2. HDFS Client initiates a request open() method to FileSystem object, which is part of DistributedFileSystem.

      3. Filesystem will interact with Namenode using Remote Procedure Call(RPC) to get the block location details.

      4. NameNode check the access credentials of the HDFS Client.

      4.1. If Authorised then the following details are passed back to calling object

      4.1.1.1. List of Block

      4.1.1.2. Address of DataNodes(DN) sorted by proximity to Client

      4.1.1.3. Security Token for authentication when accessing DataNodes

      4.2. If NOT Authorised then request is denied

      5. HDFS Client will initiate a request read() method of FSDataInputStream object to access data DIRECTLY from in parallel from various DataNodes. *Refer Notes*

      6. After reaching the end of the block the connection is closed by calling close() method.

      Notes

      a> All the interactions between HDFS Client and DataNode is managed by DFSInputStream.

      b) Read operation is highly optimised as the client reads data DIRECTLY from DataNodes.

      c) NameNode is not involved in the actual data read. This is to prevent NameNode becoming a bottleneck .

      d) Due to distributed parallel read mechanism thousands of clients can read the data directly from DataNodes very efficiently

      Follow the link to learn more about: Read Operation in HDFS

Viewing 3 reply threads
  • You must be logged in to reply to this topic.