Explain Hadoop Archives?

Viewing 2 reply threads
  • Author
    Posts
    • #5744
      DataFlair TeamDataFlair Team
      Spectator

      What is Hadoop Archives used for?
      How Hadoop Archives (HAR) deals with small files issue?
      What is Hadoop Archives files (HAR)?
      What is Archives in Hadoop, explain?

    • #5746
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is best suited for large sized files. All the files’ information (meta-data like file creation date, block info, owner, etc) is stored in memory of Namenode. If there are large number of small files, then the Namenode will have to store more amount of information regarding those small files, which is inefficient as it occupies more memory of Namenode and also more seeks will be happening while getting the information regarding those files. To avoid this, there is Hadoop Archive, which is used to combine/concatenate all the small files under a single file. It uses MapReduce job to process the input files and create an archive.

      Command:
      hadoop archive -archiveName archname.har -p /parent dir. /sourcedir/* /destdir/

      Archive name should be *.har
      Parent directory is relative to the list of source directories/files destination directory is where the archive *.har should be stored.

      To list *.har file,

      hadoop fs -ls -R /destdir/archname.har

      which lists files as below,

      /destdir/archname.har/_index
      /destdir/archname.har/_masterindex
      /destdir/archname.har/part-0

      where index file contains the individual file info like offset and length of each file, and part-0 file contains concatenated data from all the individual files.

      Advantage

      *.har file can be used as an input to Mapreduce job.

      To delete *har file, we should use recursive option,

      hadoop fs -rmr /destdir/archname.har

      Limitations are:
      1) Creating har files creates copy of original files, hence it takes same space as the original(it is not compress) files. Hence, more disk space is utilized.
      We can delete the original files, once har is created.

      Once har files are created, it is unchanged. No addition/deletion of files can be done.

      2) Har files can be used as input to Mapreduce jobs, but still it takes input as individual files. Hence, processing of more number of small files, even in har files requires lot of resources which is inefficient.

    • #5748
      DataFlair TeamDataFlair Team
      Spectator

      What is small file and How it got generated :-

      files which are extreamly smaller than the block size are called small file.
      Mainly there are two reasons for producing small files. One reason is some files are pieces of a larger logical file (e.g. – log files). Other reason is some files cannot be combined together into one larger file and are essentially small. e.g. – A large corpus of images where each image is a distinct file.

      Problem with HDFS to handle small file problem:-

      As HDFS gives the programmer unlimited storage but when it comes to storing lots of small files there is a big problem. HDFS is capable of handling small number of large files but not to handle large number of small files due to the following reasons:

      1. every block is capable of handling a single file , resulting into a lot of small blocks smaller than the configured block size. to read this much of block is very time consuming.

      2. Name node keep a record of each file and block and preserve this data into memory so large number of files require more memory space.

      Techniques to resolve small file problem:–

      HAR 🙁 Hadoop Archieve )-

      Hadoop archive as name indicates based on archiving technique
      which packs number of small files into HDFS blocks more efficiently. Files in a HAR can be accessed directly
      without expanding it, as this access is done in main memory.

      Creating HAR will reduce the storage overhead of data on Namenode and reduced map operations in mapreduce
      program increases performance.

      HAR File layout limitation:-

      even though HAR solve the problem of storing small files but with some limitation as below:-

      1. File access requires two index-file read operations as well as one data-file read operations .

      2.Reading files in HAR is less efficient and much slower than reading files in HDFS.
      3.Upgrading HAR requires the changes of HDFS architecture, which may become difficult.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.