How data or file is read in HDFS?

Viewing 2 reply threads
  • Author
    Posts
    • #4882
      DataFlair TeamDataFlair Team
      Spectator

      Describe Read operation in HDFS?
      How reading of file is done in HDFS?

    • #4884
      DataFlair TeamDataFlair Team
      Spectator

      To read data from HDFS, the client needs to communicate with the namenode for metadata. The client gets name of files and its location from the namenode. The Namenode responds with details of number of Blocks of the file, replication factor, and in Datanode where each block will be stored./strong

      Now client communicates with Datanode where the blocks are actually stored. Clients start reading data in parallel from the Datanodes based on the information received from the namenodes. Once client or application receives all the blocks of the file, it will combine these blocks to form a file.
      For read performance improvement, the location of each block ordered by their distance from the client. HDFS selects the replica which is closest to the client. This reduces the read latency and bandwidth consumption. It first read the block in the same node, then another node in the same rack, and then finally another Datanode in another rack.

    • #4885
      DataFlair TeamDataFlair Team
      Spectator

      HDFS NameNode contains all the file information and address of the file location stored in slave nodes.
      When a client tries to read a file, it gets the file information from NameNode through DistributedFileSystem instance, NameNode checks if that user has access and the file exists; if everything is ok it provides token to the client, which is used for authentication to get the file from DataNode. NameNode provide list of all Blockdetail of the file and list of all datanodes of each block. Datanodes are sorted according to their proximity to the client.

      DistributedFileSystem returns an FSDataInputStream which is an input stream to the client, so client can read data from it . Then FSDataInputStream wraps DFSInputStream, which manages datanode and namenode I/O.

      Client calls read() on the stream, DFSInputStream has the closet datanode file blocks information, it connects to the closet datanode block; data is streamed back to the client. read() is repeatedly called until end of first block is read; once first block is read completely, connection is closed with that datanode and finds the best datanode for the next block to connect, it happens until complete file is read.

      When entire file is read, FSDataInputStream calls close() on the stream to close the connection.

      For more details, please refer: HDFS Read-Write operations

Viewing 2 reply threads
  • You must be logged in to reply to this topic.