How data or file is written into HDFS?

Viewing 3 reply threads
  • Author
    Posts
    • #5313
      DataFlair TeamDataFlair Team
      Spectator

      Describe Write operation in HDFS?
      How do we write data or files in HDFS?

    • #5314
      DataFlair TeamDataFlair Team
      Spectator

      When a client wants to write a file to HDFS. Client communicates to namenode for metadata. The Namenode responds with details of datanodes where client will start writing the data. Then, on basis of information from NameNode, client directly writes data on the datanode. Now datanode will create data write pipeline.

      Following are the steps of data write operation:

      1) The client first sends a create request on DistributedFileSystem APIs.

      2) Then DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace.
      Now various checks are performed by the namenode to make sure that the file does not already exist and also make sure that the client has the permission to create the file. When these checks pass, only namenode makes a record of the new file. Otherwise, file creations fail. Clint is thrown an IOException.

      3) FSDataOutputStream is returned by the DistributedFileSystem for the client to start writing data to. When Client writes data, then DFSOutputStreamsplits data into packets. Then client writes it to an internal queue, it is known as data queue. DataStreamer consumes data queue, which asks the namenode to allocate new Blocks.

      4) Pipeline is formed by the list of datanodes. Here we consider replication level 3, so there will be 3 nodes in the pipeline. In the pipeline, DataStreamer streams the packets to the first datanode, which stores the packet. Then it passes the packet toe second datanode in the pipeline. In the same way, the second datanode will store the packet and pass to the last (third) datanode in the pipeline.

      5) The internal queue of packets is also maintained by the DFSOutputStream. These queues of packets waiting to be acknowledged by the datanodes are called ack queue. When a packet is acknowledged by the datanodes in the pipeline, then it is removed from the ack queue. Once required replicas are created (by default 3), then datanode sends the acknowledgment. In the same way, all blocks are stored and replicated on the different datanodes.

      6) Client calls close() on the stream when the client have completed data writing.

      7) Then Datanode sends write confirmation to Namenode. Now Datanode sends write confirmation to the client. The Same process will repeat for each block of the file. Data transfer happen in parallel for faster write of blocks.

      Follow the link to learn more about Write operation in Hadoop

    • #5316
      DataFlair TeamDataFlair Team
      Spectator

      One more point I want to append to santosh answer(For 4th point) is “Packet contain both checksum data and the actual data for a particular Blocks“.

      Checksum is calculated by the client and is useful to check whether the retrieved data block by datanode(mainly while reading data) is corrupted or not. How to calculate checksum value, in general, client computes the hash value of the complete data available on a particular block.

      For each block client computes checksum value. This checksum data + actual data is stored on slave node. When you look at the HDFS location on slave node, for each block– two file will be created on slave node. In blk_1073742033_1214.meta, it contains checksum value of that particular block. In blk_1073742033 file, the actual data is stored. ASF provides check class , the class name is org.apache.hadoop.util.DataChecksum class implement Checksum interface.

      To learn in detail follow: Write operation in Hadoop

    • #5320
      DataFlair TeamDataFlair Team
      Spectator

      Client Needs to interact with the Namenode for any write operation in the HDFS write operation . Namenode provides the address for the slave .

      1.Client Node in the JVM sends the create request() to the File system API which in turn sends the create request to the Name node .

      2.Now Name node provides the metadata for the write operation about the slave Nodes to which the data can be written after the validation of the access rights of the client nodes for the write operation.

      3.The Client Node now directly sends the write request() to the OutputStream which can write directly on the slave node .

      4. Now the data blocks will be coped to the First data node and then it is copied to second data node and then it will be written to the third Node , the acknowledgement will be sent to the Client in reverse order (DNn..DN3,DN2,DN1) once the block write operation completed.

      5.The number copies depends on the replication factor and during the write operation all the data nodes will be in contact with the Master node .

      6.Also the write operation of the Blocks for the single copy of data will be executed in parallel across all the data nodes which will be handled by the Hadoop client (HDFS setup) .

      7.Finally close() request will be sent to the HDFS client once all the actions are completed.

      Follow the link to learn more about Write operation in Hadoop

Viewing 3 reply threads
  • You must be logged in to reply to this topic.