Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › How Hadoop Archives (HAR) deals with small files issue?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 5:11 pm #6134DataFlair TeamSpectator
What is Hadoop Archives used for?
Explain Hadoop Archives? -
September 20, 2018 at 5:11 pm #6137DataFlair TeamSpectator
Hadoop Archives are used to solve small size file problem(in which no. of input files are very small compared to block size).
Hadoop Archives combines large no. of small input files into one archive file. The archive will always have .har extension.
The contents of har file are-
_master
_index
part filePart files are nothing but the original files concatenated together into a big file. Index files are look up files which are used to look up the individual small files inside the big part files.
Instead of multiple no. of small files, this one archive file is fed to MapReduce job hence improving performance in data processing.
Har file can be created using below Command-
archive -archiveName NAME -p <parent path> [-r <replication factor>]<src>* <dest>
Once a .har file is created, you can do a listing on the .har file and you will see it contains index files and part files.
hadoop fs -ls /output/location/myhar.har
/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-0.
You can run cat command on part file to see the concatenated contents of all the input files.
-
-
AuthorPosts
- You must be logged in to reply to this topic.