Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › When and how to create hadoop archive
- This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 12:25 pm #4834DataFlair TeamSpectator
How to create Hadoop Archive (.HAR) ? When should we create Hadoop archive ?
-
September 20, 2018 at 12:25 pm #4837DataFlair TeamSpectator
Hadoop archives are created to deal with small files problem.
Small files problem:
As we know that Hadoop is designed to deal with large files and the namenode keeps the metadata of the files in its memory.
But what if we give it small files, that too in large number; After splitting the data the namenode has to keep too many records in its memory, which will make it inefficient.
So, to tackle this problem, we create Hadoop archives. Hadoop archives files pack the HDFS files into archives and we can directly use these files an as input to the MapReduce jobs.The command for it is:
$ hadoop archive -archiveName myArch.har /data If we list the archive file: $hadoop fs -ls /data/myArch.har /data/myArch..har/_index /data/myArch..har/_masterindex /data/myArch..har/part-0
where part files are the original files concatenated together with big files and index files are to look up for the small files in the big part file.
But even Hadoop archive files have few limitations. They are:
- These files are unmodifiable, i.e., once you archive them, they cannot be modified.
- They take space as much as the original files
- When they are given as input to the MapReduce jobs, the small files are processed individually by a mapper which is inefficient.
-
-
AuthorPosts
- You must be logged in to reply to this topic.