What is HDFS – Hadoop Distributed File System?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What is HDFS – Hadoop Distributed File System?

Viewing 3 reply threads
  • Author
    Posts
    • #4819
      DataFlair TeamDataFlair Team
      Spectator

      What is HDFS used for?
      What is Hadoop Distributed File System and what are its components?
      What is NameNode and DataNode in HDFS?
      Why Hadoop Hadoop uses filesystem for storage?

    • #4820
      DataFlair TeamDataFlair Team
      Spectator

      HDFS stands for Hadoop Distributed File System.
      It was developed using distributed file system design. It runs on commodity hardware.
      It is Fault Tolerant and designed using low-cost hardware.

      HDFS IS WORLD MOST RELIABLE DATA STORAGE.
      It holds very large amount of data and provides very easier access.To store such huge data, the files are stored across multiple machines.

      Components of HDFS

      1.NameNode– It is also known as Master node. Namenode stores meta-data i.e. number of Data Blocks, their location, replicas and other details.
      Task of NameNode

      • Manage file system namespace.
      • Regulates client’s access to files.
      • It also executes file system execution such as naming, closing, opening files and directories.

      2.DataNode– It is also known as Slave.In Hadoop HDFS, DataNode is responsible for storing actual data. DataNode performs read and write operation as per request for the clients in HDFS.
      Task of DataNode

      • Block replica creation, deletion, and replication according to the instruction of NameNode.
      • DataNode manages data storage of the system.

      Features of HDFS:

      1.It is suitable for the distributed storage and processing.

      2.HDFS provides file permissions and authentications.

      3.HDFC client will divide the file into separate blocks.

      To learn more about HDFS please follow: HDFS Tutorial

    • #4823
      DataFlair TeamDataFlair Team
      Spectator

      1) What is HDFS used for?

      Hadoop Distributed File System- HDFS is used for storing structure and unstructured data in distributed manner by using commodity hardware.

      2) What is Hadoop Distributed File System and what are its components?

      Hadoop HDFS is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

      Components of HDFS:
      HDFS comprises of 3 important components NameNode, DataNode and Secondary NameNode.
      HDFS operates on a Master-Slave architecture model where the NameNode acts as the master node for keeping a track of the storage cluster and the DataNode acts as a slave node summing up to the various systems within a Hadoop cluster.

      3) What is NameNode and DataNode in HDFS?
      Namenode is the master and Datanodes are slaves
      NameNode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.
      Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of Blocks that they are storing. Without the namenode, the filesystem cannot be used.

      4) Why Hadoop uses filesystem for storage?
      HDFS is built to support applications with large data sets, including individual files that reach into the terabytes. File systems are more affordable to handle huge amount of data.

      Follow the link to learn more about HDFS in Hadoop

    • #4824
      DataFlair TeamDataFlair Team
      Spectator

      1) HDFS is the storage of layer of Hadoop
      2) HDFS is a distributed file system (Data is stored at Application level) which can store very large number of files that are running on cluster of machines.

      HDFS has two nodes

      1) Master node: In this node, namenode daemon is running in the background to support master node/non-daemon
      tasks.
      2) Slave node(s): In this node, datanode daemon is running in the background to support slave node/non-daemon tasks.

      In both nodes, we are having HDFS component.

      Namenode stores metadata information of all the datanodes in Master node HDFS component.
      Datanode stores the actual data in Slave(s) node HDFS component. Actual data means — it doesn’t stored the actual file into datanode. Firstly the file is divided into Data Blocks and these blocks and there blocks are stored across the cluster of machine(s). In Hadoop 2.x , the default block size is 128 MB. In hadoop 1.x, the default block size is 64MB.

      You can go to the location where metadata and blocks information are stored on Master and Slave(s) HDFS component.

      In core-site.xml, we have added hadoop.tmp.dir parameter. What does this indicates?.
      That parameter indicates that where the HDFS file system path is available on Master/Slave nodes. In this parameter, we have added a value, that is /home/hdamin/hadata. This path indicates that it is a HDFS file system location. In this path, two directories will created, one is for namenode, another is for datanode.

      • All metadata information is stored in namenode directory.
      • The actual data is stored is in the datanode directory.

      When you look at the below output of ls command, we have three directories, one is for datanode (data), another is for namenode(name), and one more is for secondary name node.
      You can go through the directories and see the blocks, fsimage, editlogs…etc.

      • In /hdata/dfs/name/current, metadata information files are stored(On master node).
      • In /hdata/dfs/data/current/BP-1940002228-127.0.1.1-1495701506114/current/finalized/subdir0/subdir0 path, actual data/blocks files got created(On slave node(s)).
      core-site.xml:
      	<property>
      	<name>hadoop.tmp.dir</name>
      	<value>/home/hdadmin/hdata</value>
      	</property>
      	</configuration>
      
      	hdadmin@ubuntu:~/hdata/dfs$ ls
      	data  name  namesecondary
      
      If you want to configure both datanode and namenode in different paths, then you can use below
      parameter names.
      
      Below is for datanode:
      	dfs.datanode.data.dir
      	/home/hdadmin/data1
      
      Below is for namenode:
      	dfs.namenode.name.dir
      	/home/hdadmin/name1

      You can added these parameter names in <stronghdfs-site.xml file</strong.

      3) We can deploy HDFS on commodity hardware.
      4) HDFS is designed for storing less number of large files. We have to store very large size files in hadoop in order to get better performance in terms disk seeks, namenode disk usage, …etc. As we know that namenode should not be scalable once we have installed the hadoop because namenode contains less data when compared to datanode. Namenode works at memory level data. That means we have to store data(metadata) as much as less in Master node(namenode).

      5) HDFS provides Fault Tolerant, distributed storage, high availability,data reliability, high throughput…etc
      6) It can store multiple copies of data on different machines. By default the replication factor is 3. That means three copies of same block will be available on different slaves. As i said above HDFS provided fault tolerant. With the help of replication, we achieve the fault tolerant.
      7) We can perform read and write operation on HDFS. In Hadoop, the Read-Write operations can be performed directly by the client in order to perform read/write data from/to HDFS. Client can directly write the data on slave node(s) based on information provided by Master (namenode). Client can write only one copy of a particular data/block on datanode. Once the block is completely written, then that datanode start copying this block to another datanode. This process is continued until we achieved desired replication block on different data nodes. Duplicate copies won’t be created on the same datanode, That particular datanode will create a replica block on different nodes. On which datanode this particular block need to be replicated will be decided by Master node only. There will communication available between Master and Slave nodes via “Block report form”. In the same way, client can directly read the data from slave node(s) based on information provided by Master(Namenode).

      Normal file system: Data is stored at the kernel level. In this file system, it may be 1KB or 4KB.
      In Hadoop Distributed File system: Data is stored at application level in a distributed fashion across the cluster of nodes

      Both will store the data as blocks. In normal file systems, the block size may be 1KB or 4KB size. But in HDFS, the blocks size can be 64MB,128MB, 256MB..etc.

      When you are reading 1TB of data in the normal file system, internally block by block reading is happened. In this case, at a time it is accessing a block which is of 4KB, so it needs more disk seek time to read complete data which reduce the performance of the system.

      When you are reading 1TB of data in hadoop distributed file system, here reading a block(s) is happened in parallel. In this case at at time we are accessing the block which is of 64MB/128MB, So it needs less disk seek time to read the complete data which shows a better performance.

      For more detail follow: HDFS in Hadoop

Viewing 3 reply threads
  • You must be logged in to reply to this topic.