This topic contains 1 reply, has 1 voice, and was last updated by  dfbdteam3 1 year ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #4834

    dfbdteam3
    Moderator

     

    How to create Hadoop Archive (.HAR) ? When should we create Hadoop archive ?

     

    • This topic was modified 1 year ago by  dfbdteam3.
    #4837

    dfbdteam3
    Moderator

    Hadoop archives are created to deal with small files problem.

    Small files problem:
    As we know that Hadoop is designed to deal with large files and the namenode keeps the metadata of the files in its memory.
    But what if we give it small files, that too in large number; After splitting the data the namenode has to keep too many records in its memory, which will make it inefficient.
    So, to tackle this problem, we create Hadoop archives. Hadoop archives files pack the HDFS files into archives and we can directly use these files an as input to the MapReduce jobs.

    The command for it is:

    $ hadoop archive -archiveName myArch.har /data
    
    If we list the archive file:
    $hadoop fs -ls /data/myArch.har
    
     /data/myArch..har/_index
     /data/myArch..har/_masterindex
     /data/myArch..har/part-0

    where part files are the original files concatenated together with big files and index files are to look up for the small files in the big part file.

    But even Hadoop archive files have few limitations. They are:

    • These files are unmodifiable, i.e., once you archive them, they cannot be modified.
    • They take space as much as the original files
    • When they are given as input to the MapReduce jobs, the small files are processed individually by a mapper which is inefficient.
Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.