What is hadoop archive

Viewing 2 reply threads
  • Author
    Posts
    • #6210
      DataFlair TeamDataFlair Team
      Spectator

      What is archive in Hadoop explain briefly?
      How to create the archive in Hadoop? What is the need for it?

    • #6212
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop archive is a facility which packs up small files into one compact HDFSblock to avoid memory wastage of name node.name node stores the metadata information of the the HDFS data.SO,say 1GB file is broken in 1000 pieces then namenode will have to store metadata about all those 1000 small files.In that manner,namenode memory willbe wasted it storing and managing a lot of data.

      HAR is created from a collection of files and the archiving tool will run a MapReduce job.these Maps reduce jobs to process the input files in parallel to create an archive file.

      HAR command

      hadoop archive -archiveName myhar.har /input/location /output/location

    • #6213
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is created to deal with large files data .So small files are problematic and to be handled efficiently.

      As large input file is splitted into number of small input files and stored across all the data nodes, all these huge number of records are to be stored in name node which makes name node inefficient. To handle this problem, Hadoop Archieve has been created which packs the HDFS files into archives and we can directly use these files an as input to the MR jobs. It always comes with *.har extension.

      HAR Syntax:
      hadoop archive -archiveName NAME -p <parent path> <src>* <dest>

      Example:
      hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

      If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input, all you need to specify the input directory as har:///user/zoo/foo.har.

      If we list the archive file:

      $hadoop fs -ls /data/myArch.har
      
       /data/myArch..har/_index
       /data/myArch..har/_masterindex
       /data/myArch..har/part-0

      part files are the original files concatenated together with big files and index files are to look up for the small files in the big part file.

      Limitations of HAR Files:

      1) Creation of HAR files will create a copy of the original files. So, we need as much disk space as size of original files which we are archiving.We can delete the original files after creation of archive to release some disk space.
      2) Once an archive is created, to add or remove files from/to archive we need to re-create the archive.
      3) HAR file will require lots of map tasks which are inefficient.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.