September 20, 2018 at 12:21 pm #4812
What is HDFS – Hadoop Distributed File System?
Is it similar to the normal filesystem?September 20, 2018 at 12:21 pm #4814
HDFS is the primary storage used by Hadoop applications, which is more reliable, Fault Tolerant. HDFS is a file system
of Hadoop which is designed for storing large files. It stores data reliably even in the case of hardware failure. It stores a huge number of smaller Data Blocks of the file (actually file is divided into blocks) which would help for processing data distributedly.
HDFS is a distributed file system that provides high-performance access to data across Hadoop cluster. HDFS is deployed on low-cost commodity hardware. The difference between HDFS and the normal file system is:
- Block size of HDFS is 128 MB whereas normal file system it’s only 4 KB (depends on operating system)
- The data is stored in multiple machines rather than single system.
To learn more about HDFS please follow: HDFS TutorialSeptember 20, 2018 at 12:21 pm #4816
1) HDFS is the storage of layer of Hadoop
2) HDFS is a distributed file system (Data is stored at Application level) which can store very large number of files that are running on cluster of machines.
HDFS has two nodes
1) Master node: In this node, namenode daemon is running in the background to support master node/non-daemon
2) Slave node(s): In this node, datanode daemon is running in the background to support slave node/non-daemon tasks.
In both nodes, we are having HDFS component.
Namenode stores metadata information of all the datanodes in Master node HDFS component.
Datanode stores the actual data in Slave(s) node HDFS component. Actual data means — it doesn’t stored the actual file into datanode. Firstly the file is divided into Data Blocks and these blocks and there blocks are stored across the cluster of machine(s). In Hadoop 2.x , the default block size is 128 MB. In hadoop 1.x, the default block size is 64MB.
You can go to the location where metadata and blocks information are stored on Master and Slave(s) HDFS component.
In core-site.xml, we have added hadoop.tmp.dir parameter. What does this indicates?.
That parameter indicates that where the HDFS file system path is available on Master/Slave nodes. In this parameter, we have added a value, that is /home/hdamin/hadata. This path indicates that it is a HDFS file system location. In this path, two directories will created, one is for namenode, another is for datanode.
- All metadata information is stored in namenode directory.
- The actual data is stored is in the datanode directory.
When you look at the below output of ls command, we have three directories, one is for datanode (data), another is for namenode(name), and one more is for secondary name node.
You can go through the directories and see the blocks, fsimage, editlogs…etc.
- In /hdata/dfs/name/current, metadata information files are stored(On master node).
- In /hdata/dfs/data/current/BP-1940002228-127.0.1.1-1495701506114/current/finalized/subdir0/subdir0 path, actual data/blocks files got created(On slave node(s)).
core-site.xml: <property> <name>hadoop.tmp.dir</name> <value>/home/hdadmin/hdata</value> </property> </configuration> hdadmin@ubuntu:~/hdata/dfs$ ls data name namesecondary If you want to configure both datanode and namenode in different paths, then you can use below parameter names. Below is for datanode: dfs.datanode.data.dir /home/hdadmin/data1 Below is for namenode: dfs.namenode.name.dir /home/hdadmin/name1
You can added these parameter names in <stronghdfs-site.xml file</strong.
3) We can deploy HDFS on commodity hardware.
4) HDFS is designed for storing less number of large files. We have to store very large size files in hadoop in order to get better performance in terms disk seeks, namenode disk usage, …etc. As we know that namenode should not be scalable once we have installed the hadoop because namenode contains less data when compared to datanode. Namenode works at memory level data. That means we have to store data(metadata) as much as less in Master node(namenode).
5) HDFS provides Fault Tolerant, distributed storage, high availability,data reliability, high throughput…etc
6) It can store multiple copies of data on different machines. By default the replication factor is 3. That means three copies of same block will be available on different slaves. As i said above HDFS provided fault tolerant. With the help of replication, we achieve the fault tolerant.
7) We can perform read and write operation on HDFS. In Hadoop, the Read-Write operations can be performed directly by the client in order to perform read/write data from/to HDFS. Client can directly write the data on slave node(s) based on information provided by Master (namenode). Client can write only one copy of a particular data/block on datanode. Once the block is completely written, then that datanode start copying this block to another datanode. This process is continued until we achieved desired replication block on different data nodes. Duplicate copies won’t be created on the same datanode, That particular datanode will create a replica block on different nodes. On which datanode this particular block need to be replicated will be decided by Master node only. There will communication available between Master and Slave nodes via “Block report form”. In the same way, client can directly read the data from slave node(s) based on information provided by Master(Namenode).
Normal file system: Data is stored at the kernel level. In this file system, it may be 1KB or 4KB.
In Hadoop Distributed File system: Data is stored at application level in a distributed fashion across the cluster of nodes
Both will store the data as blocks. In normal file systems, the block size may be 1KB or 4KB size. But in HDFS, the blocks size can be 64MB,128MB, 256MB..etc.
When you are reading 1TB of data in the normal file system, internally block by block reading is happened. In this case, at a time it is accessing a block which is of 4KB, so it needs more disk seek time to read complete data which reduce the performance of the system.
When you are reading 1TB of data in hadoop distributed file system, here reading a block(s) is happened in parallel. In this case at at time we are accessing the block which is of 64MB/128MB, So it needs less disk seek time to read the complete data which shows a better performance.
For more detail follow: HDFS in Hadoop
You must be logged in to reply to this topic.