Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › What is hadoop archive
- This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 5:22 pm #6210DataFlair TeamSpectator
What is archive in Hadoop explain briefly?
How to create the archive in Hadoop? What is the need for it? -
September 20, 2018 at 5:22 pm #6212DataFlair TeamSpectator
Hadoop archive is a facility which packs up small files into one compact HDFSblock to avoid memory wastage of name node.name node stores the metadata information of the the HDFS data.SO,say 1GB file is broken in 1000 pieces then namenode will have to store metadata about all those 1000 small files.In that manner,namenode memory willbe wasted it storing and managing a lot of data.
HAR is created from a collection of files and the archiving tool will run a MapReduce job.these Maps reduce jobs to process the input files in parallel to create an archive file.
HAR command
hadoop archive -archiveName myhar.har /input/location /output/location
-
September 20, 2018 at 5:22 pm #6213DataFlair TeamSpectator
Hadoop is created to deal with large files data .So small files are problematic and to be handled efficiently.
As large input file is splitted into number of small input files and stored across all the data nodes, all these huge number of records are to be stored in name node which makes name node inefficient. To handle this problem, Hadoop Archieve has been created which packs the HDFS files into archives and we can directly use these files an as input to the MR jobs. It always comes with *.har extension.
HAR Syntax:
hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Example:
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input, all you need to specify the input directory as har:///user/zoo/foo.har.
If we list the archive file:
$hadoop fs -ls /data/myArch.har /data/myArch..har/_index /data/myArch..har/_masterindex /data/myArch..har/part-0
part files are the original files concatenated together with big files and index files are to look up for the small files in the big part file.
Limitations of HAR Files:
1) Creation of HAR files will create a copy of the original files. So, we need as much disk space as size of original files which we are archiving.We can delete the original files after creation of archive to release some disk space.
2) Once an archive is created, to add or remove files from/to archive we need to re-create the archive.
3) HAR file will require lots of map tasks which are inefficient.
-
-
AuthorPosts
- You must be logged in to reply to this topic.