These 50+ Hadoop HDFS Interview Questions and Answers are from different components of HDFS. If you want to become a Hadoop Admin or Hadoop developer, then DataFlair is an appropriate place.
We were fully alert while framing these questions. Do comment your thoughts in comment section below.
Frequently Asked HDFS Interview Questions And Answers
1) What is Hadoop HDFS – Hadoop Distributed File System?
Hadoop distributed file system-HDFS is the primary storage system of Hadoop. HDFS stores very large files running on a cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files. HDFS stores data reliably even in the case of hardware failure. It also provides high throughput access to the application by accessing in parallel.
Components of HDFS:
- NameNode – It works as Master in Hadoop cluster. Namenode stores meta-data i.e. number of blocks, replicas and other details. Meta-data is present in memory in the master to provide faster retrieval of data. NameNode maintains and also manages the slave nodes, and assigns tasks to them. It should deploy on reliable hardware as it is the centerpiece of HDFS.
- DataNode – It works as Slave in Hadoop cluster. In Hadoop HDFS, DataNode is responsible for storing actual data in HDFS. It also performs read and writes operation as per request for the clients. DataNodes can deploy on commodity hardware.
2) What are the key features of HDFS?
The various Features of HDFS are:
- Fault Tolerance – In Apache Hadoop HDFS, Fault-tolerance is working strength of a system in unfavorable conditions. Hadoop HDFS is highly fault-tolerant, in HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster. If any machine in the cluster goes down due to unfavorable conditions, then a client can easily access their data from other machines which contain the same copy of data blocks.
- High Availability – HDFS is highly available file system; data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in the HDFS cluster. Hence, when a client wants to access his data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. At the time of failure of a node, a client can easily access their data from other nodes.
- Data Reliability – HDFS is a distributed file system which provides reliable data storage. HDFS can store data in the range of 100s petabytes. It stores data reliably by creating a replica of each and every block present on the nodes and hence, provides fault tolerance facility.
- Replication – Data replication is one of the most important and unique features of HDFS. In HDFS, replication data is done to solve the problem of data loss in unfavorable conditions like crashing of the node, hardware failure and so on.
- Scalability – HDFS stores data on multiple nodes in the cluster, when requirement increases we can scale the cluster. There are two scalability mechanisms available: vertical and horizontal.
- Distributed Storage – In HDFS all the features are achieved via distributed storage and replication. In HDFS data is stored in distributed manner across the nodes in the HDFS cluster.
3) What is the difference between NAS and HDFS?
- Hadoop distributed file system (HDFS) is the primary storage system of Hadoop. HDFS designs to store very large files running on a cluster of commodity hardware. While Network-attached storage (NAS) is a file-level computer data storage server. NAS provides data access to a heterogeneous group of clients.
- HDFS distribute data blocks across all the machines in a cluster. Whereas NAS, data stores on a dedicated hardware.
- Hadoop HDFS is designed to work with MapReduce Framework. In MapReduce Framework computation move to the data instead of Data to computation. NAS is not suitable for MapReduce, as it stores data separately from the computations.
- Hadoop HDFS runs on the cluster commodity hardware which is cost effective. While a NAS is a high-end storage device which includes high cost.
4) List the various HDFS daemons in HDFS cluster?
The daemon runs in HDFS cluster are as follows:
- NameNode – It is the master node. It is responsible for storing the metadata of all the files and directories. It also has information about blocks, their location, replicas and other detail.
- Datanode – It is the slave node that contains the actual data. DataNode also performs read and write operation as per request for the clients.
- Secondary NameNode – Secondary NameNode download the FsImage and EditLogs from the NameNode. Then it merges EditLogs with the FsImage periodically. It keeps edits log size within a limit. It stores the modified FsImage into persistent storage. which we can use FsImage in the case of NameNode failure.
5) What is NameNode and DataNode in HDFS?
NameNode – It works as Master in Hadoop cluster. Below listed are the main function performed by NameNode:
- Stores metadata of actual data. E.g. Filename, Path, No. of blocks, Block IDs, Block Location, No. of Replicas, and also Slave related configuration.
- It also manages Filesystem namespace.
- Regulates client access request for actual file data file.
- It also assigns work to Slaves (DataNode).
- Executes file system namespace operation like opening/closing files, renaming files/directories.
- As Name node keep metadata in memory for fast retrieval. So it requires the huge amount of memory for its operation. It should also host on reliable hardware.
DataNode – It works as Slave in Hadoop cluster. Below listed are the main function performed by DataNode:
- Actually, stores Business data.
- It is actual worker node, so it handles Read/Write/Data processing.
- Upon instruction from Master, it performs creation/replication/deletion of data blocks.
- As DataNode store all the Business data, so it requires the huge amount of storage for its operation. It should also host on Commodity hardware.
6) What do you mean by metadata in HDFS?
In Apache Hadoop HDFS, metadata shows the structure of HDFS directories and files. It provides the various information about directories and files like permissions, replication factor. NameNode stores metadata Files which are as follows:
- FsImage – FsImage is an “Image file”. It contains the entire filesystem namespace and stored as a file in the namenode’s local file system. It also contains a serialized form of all the directories and file inodes in the filesystem. Each inode is an internal representation of file or directory’s metadata.
- EditLogs – EditLogs contains all the recent modifications made to the file system of most recent FsImage. When namenode receives a create/update/delete request from the client. Then this request is first recorded to edits file.
7) What is Block in HDFS?
Block is a continuous location on the hard drive where data is stored. In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster. HDFS client does not have any control on the block like block location. NameNode decides all such things.
The default size of the HDFS block is 128 MB, which we can configure as per the requirement. All blocks of the file are of the same size except the last block, which can be the same size or smaller.
If the data size is less than the block size, then block size will be equal to the data size. For example, if the file size is 129 MB, then 2 blocks will be created for it. One block will be of default size 128 MB and other will be 1 MB only and not 128 MB as it will waste the space (here block size is equal to data size). Hadoop is intelligent enough not to waste rest of 127 MB. So it is allocating 1Mb block only for 1MB data.
The major advantages of storing data in such block size are that it saves disk seek time.
8) Why is Data Block size set to 128 MB in Hadoop?
Because of the following reasons Block size is 128 MB:
- To reduce the disk seeks (IO). Larger the block size, lesser the file blocks. Thus, less number of disk seeks. And block can transfer within respectable limits and that to parallelly.
- HDFS have huge data sets, i.e. terabytes and petabytes of data. If we take 4 KB block size for HDFS, just like Linux file system, which has 4 KB block size. Then we would be having too many blocks and therefore too much of metadata. Managing this huge number of blocks and metadata will create huge overhead. Which is something which we don’t want? So, the block size is set to 128 MB.
On the other hand, block size can’t be so large. Because the system will wait for a very long time for the last unit of data processing to finish its work.
9) What is the difference between a MapReduce InputSplit and HDFS block?
- Block- Block in Hadoop is the continuous location on the hard drive where HDFS store data. In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores each file as blocks, and distributes it across the Hadoop cluster.
- InputSplit- InputSplit represents the data which individual Mapper will process. Further split divides into records. Each record (which is a key-value pair) will be processed by the map.
- Block- It is the physical representation of data.
- InputSplit- It is the logical representation of data. Thus, during data processing in MapReduce program or other processing techniques use InputSplit. In MapReduce, important thing is that InputSplit does not contain the input data. Hence, it is just a reference to the data.
- Block- The default size of the HDFS block is 128 MB which is configured as per our requirement. All blocks of the file are of the same size except the last block. The last Block can be of same size or smaller. In Hadoop, the files split into 128 MB blocks and then stored into Hadoop Filesystem.
- InputSplit- Split size is approximately equal to block size, by default.
Consider an example, where we need to store the file in HDFS. HDFS stores files as blocks. Block is the smallest unit of data that can store or retrieved from the disk. The default size of the block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the cluster. We have a file of 130 MB, so HDFS will break this file into 2 blocks.
Now, if one wants to perform MapReduce operation on the blocks, it will not process, as the 2nd block is incomplete. InputSplit solves this problem. InputSplit will form a logical grouping of blocks as a single block. As the InputSplit include a location for the next block. It also includes the byte offset of the data needed to complete the block.
From this, we can conclude that InputSplit is only a logical chunk of data. i.e. it has just the information about blocks address or location. Thus, during MapReduce execution, Hadoop scans through the blocks and create InputSplits.
10) How can one copy a file into HDFS with a different block size to that of existing block size configuration?
By using bellow commands one can copy a file into HDFS with a different block size:
–Ddfs.blocksize=block_size, where block_size is in bytes.
So, consider an example to explain it in detail:
Suppose, you want to copy a file called test.txt of size, say of 128 MB, into the hdfs. And for this file, you want the block size to be 32MB (33554432 Bytes) in place of the default (128 MB). So, can issue the following command:
Hadoop fs –Ddfs.blocksize=33554432-copyFromlocal/home/dataflair/test.txt/sample_hdfs.
Now, you can check the HDFS block size associated with this file by:
hadoop fs –stat %o/sample_hdfs/test.txt
You can also check it by using the NameNode web UI for seeing the HDFS directory.
11) Which one is the master node in HDFS? Can it be commodity hardware?
Name node is the master node in HDFS. The NameNode stores metadata and works as High Availability machine in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space. It cannot be a commodity as the entire HDFS works on it.
12) In HDFS, how Name node determines which data node to write on?
Namenode contains Metadata i.e. number of blocks, replicas, their location, and other details. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the Datanodes, and assigns tasks to them.
13) What is a Heartbeat in Hadoop?
Heartbeat is the signals that NameNode receives from the DataNodes to show that it is functioning (alive).
NameNode and DataNode do communicate using Heartbeat. If after a certain time of heartbeat DataNode does not send any response to NameNode, then that Node is dead. So, NameNode in HDFS will create new replicas of those blocks on other DataNodes.
Heartbeats carry information about total storage capacity. It also, carry the fraction of storage in use, and the number of data transfers currently in progress.
The default heartbeat interval is 3 seconds. One can change it by using dfs.heartbeat.interval in hdfs-site.xml.
14) Can multiple clients write into an Hadoop HDFS file concurrently?
Multiple clients cannot write into an Hadoop HDFS file at same time. Apache Hadoop follows single writer multiple reader models. When HDFS client opens a file for writing, then NameNode grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for a write operation in Hadoop. NameNode first checks whether it has granted the lease for writing into that file to someone else or not. If someone else acquires the lease, then it will reject the write request of the other client.
15) How data or file is read in Hadoop HDFS?
To read from HDFS, the first client communicates to namenode for metadata. The Namenode responds with details of No. of blocks, Block IDs, Block Location, No. of Replicas. Then, the client communicates with Datanode where the blocks are present. Clients start reading data parallel from the Datanode. It read on the basis of information received from the namenodes.
Once an application or HDFS client receives all the blocks of the file, it will combine these blocks to form a file. To improve read performance, the location of each block ordered by their distance from the client. HDFS selects the replica which is closest to the client. This reduces the read latency and bandwidth consumption. It first read the block in the same node. Then another node in the same rack, and then finally another Datanode in another rack.
16) Does HDFS allow a client to read a file which is already opened for writing?
Yes, the client can read the file which is already opened for writing. But, the problem in reading a file which is currently open for writing, lies in the consistency of data. HDFS does not provide the surety that the data which it has written into the file will be visible to a new reader. For this, one can call the hflush operation. It will push all the data in the buffer into write pipeline. Then the hflush operation will wait for acknowledgments from the datanodes. Hence, by doing this, the data that client has written into the file before the hflush operation visible to the reader for sure.
17) Why is Reading done in parallel and writing is not in HDFS?
Client read data parallelly because by doing so the client can access the data fast. Reading in parallel makes the system fault tolerant. But the client does not perform the write operation in Parallel. Because writing in parallel might result in data inconsistency.
Suppose, you have a file and two nodes are trying to write data into a file in parallel. Then the first node does not know what the second node has written and vice-versa. So, we can not identify which data to store and access.
Client in Hadoop writes data in pipeline anatomy. There are various benefits of a pipeline write:
- More efficient bandwidth consumption for the client – The client only has to transfer one replica to the first datanode in the pipeline write. So, each node only gets and send one replica over the network (except the last datanode only receives data). This results in balanced bandwidth consumption. As compared to the client writing three replicas into three different datanodes.
- Smaller sent/ack window to maintain – The client maintain a much smaller sliding window. Sliding window record which blocks in the replica is sending to the DataNodes. It also records which blocks are waiting for acks to confirm the write has been done. In a pipeline write, the client appears to write data to only one datanode.
18) What is the problem with small files in Apache Hadoop?
Hadoop is not suitable for small data. Hadoop HDFS lacks the ability to support the random reading of small files. Small file in HDFS is smaller than the HDFS block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can’t handle these lots of files. HDFS works with the small number of large files for storing large datasets. It is not suitable for a large number of small files. A large number of many small files overload NameNode since it stores the namespace of HDFS.
- HAR (Hadoop Archive) Files – HAR files deal with small file issue. HAR has introduced a layer on top of HDFS, which provides interface for file accessing. Using Hadoop archive command we can create HAR files. These file runs a MapReduce job to pack the archived files into a smaller number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS.
- Sequence Files – Sequence Files also deal with small file problem. In this, we use the filename as key and the file contents as the value. Suppose we have 10,000 files, each of 100 KB, we can write a program to put them into a single sequence file. Then one can process them in a streaming fashion.
19) What is throughput in HDFS?
The amount of work done in a unit time is known as Throughput. Below are the reasons due to HDFS provides good throughput:
- Hadoop works on Data Locality principle. This principle state that moves computation to data instead of data to computation. This reduces network congestion and therefore, enhances the overall system throughput.
- The HDFS is Write Once and Read Many Model. It simplifies the data coherency issues as the data written once, one can not modify it. Thus, provides high throughput data access.
20) Comparison between Secondary NameNode and Checkpoint Node in Hadoop?
Secondary NameNode download the FsImage and EditLogs from the NameNode. Then it merges EditLogs with the FsImage periodically. Secondary NameNode stores the modified FsImage into persistent storage. So, we can use FsImage in the case of NameNode failure. But it does not upload the merged FsImage with EditLogs to active namenode. While Checkpoint node is a node which periodically creates checkpoints of the namespace.
Checkpoint Node in Hadoop first downloads FsImage and edits from the active NameNode. Then it merges them (FsImage and edits) locally, and at last, it uploads the new image back to the active namenode.
21) What is a Backup node in Hadoop?
Backup node provides the same checkpointing functionality as the Checkpoint node (Checkpoint node is a node which periodically creates checkpoints of the namespace. Checkpoint Node downloads FsImage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode). In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace, which is always synchronized with the active NameNode state.
The Backup node does not need to download FsImage and edits files from the active NameNode in order to create a checkpoint, as would be required with a Checkpoint node or Secondary Namenode since it already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local FsImage file and reset edits. One Backup node is supported by the NameNode at a time. No checkpoint nodes may be registered if a Backup node is in use.
22) How does HDFS ensure Data Integrity of data blocks stored in HDFS?
Data Integrity ensures the correctness of the data. But, it is possible that the data will get corrupted during I/O operation on the disk. Corruption can occur due to various reasons network faults, or buggy software. Hadoop HDFS client software implements checksum checking on the contents of HDFS files.
In Hadoop, when a client creates an HDFS file, it computes a checksum of each block of the file. Then stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it first checks. Then it verifies that the data it received from each Datanode matches the checksum. Checksum stored in the associated checksum file. And if not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
23) What do you mean by the NameNode High Availability in hadoop?
In Hadoop 1.x, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would be unable to read, write file or list files. In such event, the whole Hadoop system would be out of service until new namenode is up.
Hadoop 2.0 overcomes SPOF. Hadoop 2.x provide support for multiple NameNode. High availability feature gives an extra NameNode (active standby NameNode) to Hadoop architecture. This extra NameNode configured for automatic failover. If active NameNode fails, then Standby Namenode takes all its responsibility. And cluster work continuously.
The initial implementation of namenode high availability provided for single active/standby namenode. However, some deployment requires high degree fault-tolerance. Hadoop 3.x enable this feature by allowing the user to run multiple standby namenode. The cluster tolerates the failure of 2 nodes rather than 1 by configuring 3 namenode & 5 journal nodes.
24) What is Fault Tolerance in Hadoop HDFS?
Fault-tolerance in HDFS is working strength of a system in unfavorable conditions. Unfavorable conditions are the crashing of the node, hardware failure and so on. HDFS control faults by the process of replica creation. When client stores a file in HDFS, Hadoop framework divide the file into blocks. Then client distributes data blocks across different machines present in HDFS cluster. And, then create the replica of each block is on other machines present in the cluster.
HDFS, by default, creates 3 copies of a block on other machines present in the cluster. If any machine in the cluster goes down or fails due to unfavorable conditions. Then also, the user can easily access that data from other machines in which replica of the block is present.
25) Describe HDFS Federation.
In Hadoop 1.0, HDFS architecture for entire cluster allows only single namespace.
- Namespace layer and storage layer are tightly coupled. This makes alternate implementation of namenode difficult. It also restricts other services to use block storage directly.
- A namespace is not scalable like datanode. Scaling in HDFS cluster is horizontally by adding datanodes. But we can’t add more namespace to an existing cluster.
- There is no separation of the namespace. So, there is no isolation among tenant organization that is using the cluster.
In Hadoop 2.0, HDFS Federation overcomes this limitation. It supports too many NameNode/ Namespaces to scale the namespace horizontally. In HDFS federation isolate different categories of application and users to different namespaces. This improves Read/ write operation throughput adding more namenodes.
26) What is the default replication factor in Hadoop and how will you change it?
The default replication factor is 3. One can change replication factor in following three ways:
- By adding this property to hdfs-site.xml:
<property></pre> <pre><name>dfs.replication</name></pre> <pre><value>5</value></pre> <pre><description>Block Replication</description></pre> <pre></property>
- One can also change the replication factor on a per-file basis using the command:
hadoop fs –setrep –w 3 / file_location
- One can also change replication factor for all the files in a directory by using:
hadoop fs –setrep –w 3 –R / directoey_location
27) Why Hadoop performs replication, although it results in data redundancy?
In HDFS, Replication provides the fault tolerance. Replication is one of the unique features of HDFS. Data Replication solves the issue of data loss in unfavorable conditions. Unfavorable conditions are the hardware failure, crashing of the node and so on.
HDFS by default creates 3 replicas of each block across the cluster in Hadoop. And we can change it as per the need. So if any node goes down, we can recover data on that node from the other node. In HDFS, Replication will lead to the consumption of a lot of space. But the user can always add more nodes to the cluster if required. It is very rare to have free space issues in practical cluster. As the very first reason to deploy HDFS was to store huge data sets. Also, one can change the replication factor to save HDFS space. Or one can also use different codec provided by the Hadoop to compress the data.
28) What is Rack Awareness in Apache Hadoop?
In Hadoop, Rack Awareness improves the network traffic while reading/writing file. In Rack Awareness NameNode chooses the DataNode which is closer to the same rack or nearby rack. NameNode achieves Rack information by maintaining the rack ids of each DataNode. Thus, this concept chooses Datanodes based on the Rack information.
HDFS NameNode makes sure that all the replicas are not stored on the single rack or same rack. It follows Rack Awareness Algorithm to reduce latency as well as fault tolerance.
Default replication factor is 3. Therefore according to Rack Awareness Algorithm:
- The first replica of the block will store on a local rack.
- The next replica will store on another datanode within the same rack.
- And the third replica stored on the different rack.
In Hadoop, we need Rack Awareness for below reason: It improves-
- Data high availability and reliability.
- The performance of the cluster.
- Network bandwidth.
29) Explain the Single point of Failure in Hadoop?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would unable to read/write files. In such event, whole Hadoop system would be out of service until new namenode is up.
Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. High availability feature provides an extra NameNode to hadoop architecture. This feature provides automatic failover. If active NameNode fails, then Standby-Namenode takes all the responsibility of active node. And cluster continues to work.
The initial implementation of Namenode high availability provided for single active/standby namenode. However, some deployment requires high degree fault-tolerance. So new version 3.0 enable this feature by allowing the user to run multiple standby namenode. The cluster tolerates the failure of 2 nodes rather than 1 by configuring 3 namenode & 5 journalnodes.
30) Explain Erasure Coding in Apache Hadoop?
For several purposes, HDFS, by default, replicates each block three times. Replication also provides the very simple form of redundancy to protect against the failure of datanode. But replication is very expensive. 3 x replication scheme results in 200% overhead in storage space and other resources.
Hadoop 2.x introduced a new feature called “Erasure Coding” to use in the place of Replication. It also provides the same level of fault tolerance with less space store and 50% storage overhead.
Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID implements Erasure Coding through striping. In which it divides logical sequential data (such as a file) into the smaller unit (such as bit, byte or block). After that, it stores data on different disk.
Encoding – In this, RAID calculates and sort Parity cells for each strip of data cells. Then, recover error through the parity. Erasure coding extends a message with redundant data for fault tolerance. Its codec operates on uniformly sized data cells. In Erasure Coding, codec takes a number of data cells as input and produces parity cells as the output.
There are two algorithms available for Erasure Coding:
- XOR Algorithm
- Reed-Solomon Algorithm
31) What is Balancer in Hadoop?
Data may not always distribute uniformly across the datanodes in HDFS due to following reasons:
- A lot of deletes and writes
- Disk replacement
Data Blocks allocation strategy tries hard to spread new blocks uniformly among all the datanodes. In a large cluster, each node has different capacity. While quite often you need to delete some old nodes, also add new nodes for more capacity.
The addition of new datanode becomes a bottleneck due to below reason:
- When Hadoop framework allocates all the new blocks and read from new datanode. This will overload the new datanode.
HDFS provides a tool called Balancer that analyzes block placement and balances across the datanodes.
32) What is Disk Balancer in Apache Hadoop?
Disk Balancer is a command line tool, which distributes data evenly on all disks of a datanode. This tool operates against a given datanode and moves blocks from one disk to another.
Disk balancer works by creating and executing a plan (set of statements) on the datanode. Thus, the plan describes how much data should move between two disks. A plan composes multiple steps. Move step has source disk, destination disk and the number of bytes to move. And the plan will execute against an operational datanode.
By default, disk balancer is not enabled. Hence, to enable diskbalnecr dfs.disk.balancer.enabled must be set true in hdfs-site.xml.
When we write new block in hdfs, then, datanode uses volume choosing the policy to choose the disk for the block. Each directory is the volume in hdfs terminology. Thus, two such policies are: Round-robin and Available space
- Round-robin distributes the new blocks evenly across the available disks.
- Available space writes data to the disk that has maximum free space (by percentage).
33) What is active and passive NameNode in Hadoop?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, then all clients would be unable to read, write file or list files. In such event, whole Hadoop system would be out of service until new namenode is up.
Hadoop 2.0 overcomes this SPOF. Hadoop 2.0 provides support for multiple NameNode. High availability feature provides an extra NameNode to Hadoop architecture for automatic failover.
- Active NameNode – It is the NameNode which works and runs in the cluster. It is also responsible for all client operations in the cluster.
- Passive NameNode – It is a standby namenode, which has similar data as active NameNode. It simply acts as a slave, maintains enough state to provide a fast failover, if necessary.
If Active NameNode fails, then Passive NameNode takes all the responsibility of active node. The cluster works continuously.
34) How is indexing done in Hadoop HDFS?
Apache Hadoop has a unique way of indexing. Once Hadoop framework store the data as per the Data Bock size. HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.
35) What is a Block Scanner in HDFS?
Block scanner verify whether the data blocks stored on each DataNodes are correct or not. When Block scanner detects corrupted data block, then following steps occur:
- First of all, DataNode report NameNode about the corrupted block.
- After that, NameNode will start the process of creating a new replica. It creates new replica using the correct replica of the corrupted block present in other DataNodes.
- When the replication count of the correct replicas matches the replication factor 3, then delete corrupted block
36) How to perform the inter-cluster data copying work in HDFS?
HDFS use distributed copy command to perform the inter-cluster data copying. That is as below:
hadoop distcp hdfs://<source NameNode> hdfs://<target NameNode>
DistCp (distributed copy) is a tool also used for large inter/intra-cluster copying. It uses MapReduce to affect its distribution, error handling and recovery and reporting. This distributed copy tool enlarges a list of files and directories into the input to map tasks.
37) What are the main properties of hdfs-site.xml file?
hdfs-site.xml – It specifies configuration setting for HDFS daemons in Hadoop. It also provides default block replication and permission checking on HDFS.
The three main hdfs-site.xml properties are:
- dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs). It also specifies where DFS should locate, on the disk or onto the remote directory.
- dfs.data.dir gives the location of DataNodes where it stores the data.
- fs.checkpoint.dir is the directory on the file system. Hence, on this directory secondary NameNode stores the temporary images of edit logs.
38) How can one check whether NameNode is working or not?
One can check the status of the HDFS NameNode in several ways. Most usually, one uses the jps command to check the status of all daemons running in the HDFS.
39) How would you restart NameNode?
NameNode is also known as Master node. It stores meta-data i.e. number of blocks, replicas, and other details. NameNode maintains and manages the slave nodes, and assigns tasks to them.
By following two methods, you can restart NameNode:
- First stop the NameNode individually using ./sbin/hadoop-daemons.sh stop namenode command. Then, start the NameNode using ./sbin/hadoop-daemons.sh start namenode command.
- Use ./sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all the demons first. Then start all the daemons.
40) How NameNode tackle Datanode failures in Hadoop?
HDFS has a master-slave architecture in which master is namenode and slave are datanode. HDFS cluster has single namenode that manages file system namespace (metadata) and multiple datanodes that are responsible for storing actual data in HDFs and performing the read-write operation as per request for the clients.
NameNode receives Heartbeat and block report from Datanode. Heartbeat receipt implies that the datanode is alive and functioning properly and block report contains a list of all blocks on a datanode. When NameNode observes that DataNode has not sent heartbeat message after a certain amount of time, the datanode is marked as dead. The namenode replicates the blocks of the dead node to another datanode using the replica created earlier. Hence, NameNode can easily handle Datanode failure.
41) Is Namenode machine same as DataNode machine as in terms of hardware in Hadoop?
NameNode is highly available server, unlike DataNode. NameNode manages the File System Namespace. It also maintains the metadata information. Metadata information is the number of blocks, their location, replicas and other details. It also executes file system execution such as naming, closing, opening files/directories.
Because of the above reasons, NameNode requires higher RAM for storing the metadata for millions of files. Whereas, DataNode is responsible for storing actual data in HDFS. It performs read and write operation as per request of the clients. Therefore, Datanode needs to have a higher disk capacity for storing huge data sets.
42) If DataNode increases, then do we need to upgrade NameNode in Hadoop?
Namenode stores meta-data i.e. number of blocks, their location, replicas. In Hadoop, meta-data is present in memory in the master for faster retrieval of data. NameNode manages and maintains the slave nodes, and assigns tasks to them. It regulates client’s access to files.
It also executes file system execution such as naming, closing, opening files/directories. During Hadoop installation, framework determines NameNode based on the size of the cluster. Mostly we don’t need to upgrade the NameNode because it does not store the actual data. But it stores the metadata, so such requirement rarely arise.
43) Explain what happens if, during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3?
Replication factor can be set for the entire cluster to adjust the number of replicated block. It ensures high data availability.
The cluster will have n-1 duplicated blocks, for every block that are present in HDFS. So, if the replication factor during PUT operation is set to 1 in place of the default value 3. Then it will have a single copy of data. If one set replication factor 1. And if DataNode crashes under any circumstances, then an only single copy of the data would lose.
44) What are file permissions in HDFS? how does HDFS check permissions for files/directory?
Hadoop distributed file system (HDFS) implements a permissions model for files/directories.
For each files/directory, one can manage permissions for a set of 3 distinct user classes: owner, group, others
There are also 3 different permissions for each user class: Read (r), write (w), execute(x)
- For files, the w permission is to write to the file and the r permission is to read the file.
- For directories, w permission is to create or delet the directory. The r permission is to list the contents of the directory.
- The X permission is to access a child of the directory.
HDFS check permissions for files or directory:
- We can also check the owner’s permissions, if the user name matches the owner of directory.
- If the group matches the directory’s group, then Hadoop tests the user’s group permissions.
- Hadoop tests the “other” permission, when owner and the group names doesn’t match.
- If none of the permissions checks succeed, then client’s request is denied.
45) How one can format Hadoop HDFS?
One can format HDFS by using bin/hadoop namenode –format command.
$bin/hadoop namenode –format$ command formats the HDFS via NameNode.
Formatting implies initializing the directory specified by the dfs.name.dir variable. When you run this command on existing files system, then, you will lose all your data stored on your NameNode.
Hadoop NameNode directory contains the FsImage and edit files. This hold the basic information’s about Hadoop file system. So, this basic information includes like which user created files etc.
Hence, when we format the NameNode, then it will delete above information from directory. This information is present in the hdfs-site.xml as dfs.namenode.name.dir. So, formatting a NameNode will not format the DataNode.
NOTE: Never format, up and running Hadoop Filesystem. You will lose data stored in the HDFS.
46) What is the process to change the files at arbitrary locations in HDFS?
HDFS doesn’t support modifications at arbitrary offsets in the file or multiple writers. But a single writer writes files in append-only format. Writes to a file in HDFS are always made at the end of the file.
47) Differentiate HDFS & HBase.
Data write process
- HDFS- Append method
- HBase- Bulk incremental, random write
Data read process
- HDFS- Table scan
- HBase- Table scan/random read/small range scan
Hive SQL querying
- HDFS- Excellent
- HBase- Average
48) What is meant by streaming access?
HDFS works on the principle of “write once, read many”. Its focus is on fast and accurate data retrieval. Steaming access means reading the complete data instead of retrieving a single record from the database.
49) How to transfer data from Hive to HDFS?
One can transfer data from Hive by writing the query:
hive> insert overwrite directory ‘/’ select * from emp;
Hence, the output you receive will be stored in part files in the specified HDFS path.
50) How to add/delete a Node to the existing cluster?
To add a Node to the existing cluster follow:
Add the host name/Ip address in dfs.hosts/slaves file. Then, refresh the cluster with
$hadoop dfsamin -refreshNodes
To delete a Node to the existing cluster follow:
Add the hostname/Ip address to dfs.hosts.exclude/remove the entry from slaves file. Then, refresh the cluster with $hadoop dfsamin -refreshNodes
$hadoop dfsamin -refreshNodes
51) How to format the HDFS? How frequently it will be done?
$hadoop namnode -format.
Note: Format the HDFS only once that to during initial cluster setup.
52) What is the importance of dfs.namenode.name.dir in HDFS?
dfs.namenode.name.dir contains the fsimage file for namenode.
We should configure it to write to atleast two filesystems on physical hosts/namenode/secondary namenode. Because if we lose FsImage file we will lose entire HDFS file system. Also there is no other recovery mechanism if there is no FsImage file available.