What is Secondary NameNode in Hadoop?

Viewing 4 reply threads
  • Author
    Posts
    • #5760
      DataFlair TeamDataFlair Team
      Spectator

      How Secondary NameNode solves the issue of Namenode?
      Why Secondary NameNode is used in Hadoop?
      Is Secondary Name node a hot standby of namenode?

    • #5762
      DataFlair TeamDataFlair Team
      Spectator

      Secondary Namenode, by its name we assume that it as a backup node but its not. First let me give a brief about Namenode.

      Namenode holds the metadata for HDFS like Block information, size etc. This Information is stored in main memory as well as disk for persistence storage .

      The information is stored in 2 different files .They are

      • Editlogs- It keeps track of each and every changes to HDFS.
      • Fsimage- It stores the snapshot of the file system.

      Any changes done to HDFS gets noted in the edit logos the file size grows where as the size of fsimage remains same. This not have any impact until we restart the server. When we restart the server the edit file logs are written into fsimage file and loaded into main memory which takes some time. If we restart the cluster after a long time there will be a vast down time since the edit log file would have grown. Secondary namenode would come into picture in rescue of this problem.

      Secondary Namenode simply gets edit logs from name node periodically and copies to fsimage. This new fsimage is copied back to namenode.Namenode now, this uses this new fsimage for next restart which reduces the startup time. It is a helper node to Namenode and to precise Secondary Namenode whole purpose is to have checkpoint in HDFS, which helps namenode to function effectively. Hence, It is also called as Checkpoint node.

    • #5763
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop 1.0 the namenode was the single point of failure . But in Hadoop 2

      Now there are two important files which reside in the namenode’ s current directory,

      1. FsImage file :-This file is the snapshot of the HDFS metadata at a certain point of time .
      2. Edits Log file :-This file stores the records for changes that have been made in the HDFS namespace .

      The main function of the Secondary namenode is to store the latest copy of the FsImage and the Edits Log files.

      How does it help?

      When the namenode is restarted , the latest copies of the Edits Log files are applied to the FsImage file in order to keep the HDFS metadata latest.
      So it becomes very important to store a copy of these two files , which is done by secondary namenode.

      Now to keep latest versions of these two files, the secondary name node takes the checkpoints at hourly basis which is the default time gap .

      Checkpoint:-

      A checkpoint is nothing but the updation of the latest FsImage file by applying the latest Edits Log files to it .If the time gap of a checkpoint is large the there will be too many Edits Log files generated and it will be very cumbersome and time consuming to apply them all at once on the latest FsImage file . And this may lead to acute start time for the primary namenode after a reboot .

      However, the secondary namenode is just a helper to the primary namenode in a HDFS cluster as it cannot perform all the functions of the primary namenode .

      Note:-

      There are two options to which can be used along with secondary namenode command

      1. -geteditsize:- this option helps to find the current size of the edit_ingress file present in namenode’ s current directory.Here edit_ingress file is the ongoing in progress Edits Log file .

      2. -checkpoint [force]:- this option forcefully checkpoints the secondary namenode to the latest state of the primary namenode , whatever may the size of the Edits Log file may be. But ideally the size of the Edits Log file should be greater than or equal to the checkpoint file size .

    • #5764
      DataFlair TeamDataFlair Team
      Spectator

      Namenode holds the meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But these information also stored in disk for persistence storage.

      fsimage – Its the snapshot of the filesystem when namenode started
      Edit logs – Its the sequence of changes made to the filesystem after namenode started

      Only in the restart of namenode , edit logs are applied to fsimage to get the latest snapshot of the file system. But namenode restart are rare in production clusters which means edit logs can grow very large for the clusters where namenode runs for a long period of time. The following issues we will encounter in this situation.

      1.Editlog become very large , which will be challenging to manage it
      2.Namenode restart takes long time because lot of changes has to be merged
      3.In the case of crash, we will lost huge amount of metadata since fsimage is very old

      Secondary Namenode helps to overcome the above issues by taking over responsibility of merging editlogs with fsimage from the namenode.
      1.It gets the edit logs from the namenode in regular intervals and applies to fsimage
      2.Once it has new fsimage, it copies back to namenode
      3.Namenode will use this fsimage for the next restart,which will reduce the startup time

      Things have been changed over the years especially with Hadoop 2.x. Now Namenode is highly available with fail over feature. Secondary Namenode is optional now & Standby Namenode has been to used for failover process. Standby NameNode will stay up-to-date with all the file system changes the Active NameNode makes .

      Follow the link to learn more about HDFS Architecture

    • #5766
      DataFlair TeamDataFlair Team
      Spectator

      The Secondary namenode is a helper node in hadoop, To understand the functionality of the secondary namenodelet’s understand how the namenode works.

      Name node stores metadata like file system namespace information, blockinformation etc in the memory.It also stores the persistent copy of the same on the disk. Name node stores information in two files.

      fsimage: It’s a snapshot of the file system, stores information like modification time access time, permission, replication.
      Edit logs: It stores details of all the activities/transactions being performed on the HDFS..

      When the namenode is in the active state the edit logs size grows continuously as the edit logs can only be applied to the fsimage at the time of namenode restart, to get the latest state of the HDFS. If edit logs grows significantly and namenode tries to apply it on fsimage at the time of namenode restart, the process can take very long, here secondary node come into the play.

      Secondary namenode keeps the checkpoint on the namenode, It reads the edit logs from the namenode continuously after a specific interval and applies it to the fsimage copy of secondary namenode. In this way the fsimage file will have the most recent state of HDFS.
      The secondary namenode copies new fsimage to primary, so fsimage is updated.

      Since fsimage is updated, there will be no overhead of copying of edit logs at the moment of restarting the cluster.
      Secondary namenode is a helper node and can’t replace the namenode.

Viewing 4 reply threads
  • You must be logged in to reply to this topic.