When and how to create hadoop archive

Viewing 1 reply thread
  • Author
    Posts
    • #4834
      DataFlair TeamDataFlair Team
      Spectator

       

      How to create Hadoop Archive (.HAR) ? When should we create Hadoop archive ?

       

    • #4837
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop archives are created to deal with small files problem.

      Small files problem:
      As we know that Hadoop is designed to deal with large files and the namenode keeps the metadata of the files in its memory.
      But what if we give it small files, that too in large number; After splitting the data the namenode has to keep too many records in its memory, which will make it inefficient.
      So, to tackle this problem, we create Hadoop archives. Hadoop archives files pack the HDFS files into archives and we can directly use these files an as input to the MapReduce jobs.

      The command for it is:

      $ hadoop archive -archiveName myArch.har /data
      
      If we list the archive file:
      $hadoop fs -ls /data/myArch.har
      
       /data/myArch..har/_index
       /data/myArch..har/_masterindex
       /data/myArch..har/part-0

      where part files are the original files concatenated together with big files and index files are to look up for the small files in the big part file.

      But even Hadoop archive files have few limitations. They are:

      • These files are unmodifiable, i.e., once you archive them, they cannot be modified.
      • They take space as much as the original files
      • When they are given as input to the MapReduce jobs, the small files are processed individually by a mapper which is inefficient.
Viewing 1 reply thread
  • You must be logged in to reply to this topic.