How data or file is read in HDFS?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 12:37 pm #4882
  
  DataFlair Team
  Spectator
  
  Describe Read operation in HDFS?
  How reading of file is done in HDFS?
- September 20, 2018 at 12:37 pm #4884
  
  DataFlair Team
  Spectator
  
  To read data from HDFS, the client needs to communicate with the namenode for metadata. The client gets name of files and its location from the namenode. The Namenode responds with details of number of Blocks of the file, replication factor, and in Datanode where each block will be stored./strong
  
  Now client communicates with Datanode where the blocks are actually stored. Clients start reading data in parallel from the Datanodes based on the information received from the namenodes. Once client or application receives all the blocks of the file, it will combine these blocks to form a file.
  For read performance improvement, the location of each block ordered by their distance from the client. HDFS selects the replica which is closest to the client. This reduces the read latency and bandwidth consumption. It first read the block in the same node, then another node in the same rack, and then finally another Datanode in another rack.
- September 20, 2018 at 12:37 pm #4885
  
  DataFlair Team
  Spectator
  
  HDFS NameNode contains all the file information and address of the file location stored in slave nodes.
  When a client tries to read a file, it gets the file information from NameNode through DistributedFileSystem instance, NameNode checks if that user has access and the file exists; if everything is ok it provides token to the client, which is used for authentication to get the file from DataNode. NameNode provide list of all Blockdetail of the file and list of all datanodes of each block. Datanodes are sorted according to their proximity to the client.
  
  DistributedFileSystem returns an FSDataInputStream which is an input stream to the client, so client can read data from it . Then FSDataInputStream wraps DFSInputStream, which manages datanode and namenode I/O.
  
  Client calls read() on the stream, DFSInputStream has the closet datanode file blocks information, it connects to the closet datanode block; data is streamed back to the client. read() is repeatedly called until end of first block is read; once first block is read completely, connection is closed with that datanode and finds the best datanode for the next block to connect, it happens until complete file is read.
  
  When entire file is read, FSDataInputStream calls close() on the stream to close the connection.
  
  For more details, please refer: HDFS Read-Write operations
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

How data or file is read in HDFS?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses