When and how to create hadoop archive

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 12:25 pm #4834
  
  DataFlair Team
  Spectator
  
  How to create Hadoop Archive (.HAR) ? When should we create Hadoop archive ?
- September 20, 2018 at 12:25 pm #4837
  DataFlair Team
  Spectator
  Hadoop archives are created to deal with small files problem.
  
  Small files problem:
  As we know that Hadoop is designed to deal with large files and the namenode keeps the metadata of the files in its memory.
  But what if we give it small files, that too in large number; After splitting the data the namenode has to keep too many records in its memory, which will make it inefficient.
  So, to tackle this problem, we create Hadoop archives. Hadoop archives files pack the HDFS files into archives and we can directly use these files an as input to the MapReduce jobs.
  
  The command for it is:
```
$ hadoop archive -archiveName myArch.har /data

If we list the archive file:
$hadoop fs -ls /data/myArch.har

 /data/myArch..har/_index
 /data/myArch..har/_masterindex
 /data/myArch..har/part-0
```
  where part files are the original files concatenated together with big files and index files are to look up for the small files in the big part file.
  
  But even Hadoop archive files have few limitations. They are:
  - These files are unmodifiable, i.e., once you archive them, they cannot be modified.
  - They take space as much as the original files
  - When they are given as input to the MapReduce jobs, the small files are processed individually by a mapper which is inefficient.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

When and how to create hadoop archive

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses