How Hadoop Archives (HAR) deals with small files issue?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 5:11 pm #6134
  
  DataFlair Team
  Spectator
  
  What is Hadoop Archives used for?
  Explain Hadoop Archives?
- September 20, 2018 at 5:11 pm #6137
  
  DataFlair Team
  Spectator
  
  Hadoop Archives are used to solve small size file problem(in which no. of input files are very small compared to block size).
  
  Hadoop Archives combines large no. of small input files into one archive file. The archive will always have .har extension.
  
  The contents of har file are-
  _master
  _index
  part file
  
  Part files are nothing but the original files concatenated together into a big file. Index files are look up files which are used to look up the individual small files inside the big part files.
  
  Instead of multiple no. of small files, this one archive file is fed to MapReduce job hence improving performance in data processing.
  
  Har file can be created using below Command-
  
  archive -archiveName NAME -p <parent path> [-r <replication factor>]<src>* <dest>
  
  Once a .har file is created, you can do a listing on the .har file and you will see it contains index files and part files.
  
  hadoop fs -ls /output/location/myhar.har
  
  /output/location/myhar.har/_index
  
  /output/location/myhar.har/_masterindex
  
  /output/location/myhar.har/part-0.
  You can run cat command on part file to see the concatenated contents of all the input files.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.