Hadoop archives are created to deal with small files problem.
Small files problem:
As we know that Hadoop is designed to deal with large files and the namenode keeps the metadata of the files in its memory.
But what if we give it small files, that too in large number; After splitting the data the namenode has to keep too many records in its memory, which will make it inefficient.
So, to tackle this problem, we create Hadoop archives. Hadoop archives files pack the HDFS files into archives and we can directly use these files an as input to the MapReduce jobs.
The command for it is:
$ hadoop archive -archiveName myArch.har /data
If we list the archive file:
$hadoop fs -ls /data/myArch.har
where part files are the original files concatenated together with big files and index files are to look up for the small files in the big part file.
But even Hadoop archive files have few limitations. They are:
These files are unmodifiable, i.e., once you archive them, they cannot be modified.
They take space as much as the original files
When they are given as input to the MapReduce jobs, the small files are processed individually by a mapper which is inefficient.