How Hadoop Archives (HAR) deals with small files issue?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop How Hadoop Archives (HAR) deals with small files issue?

Viewing 1 reply thread
  • Author
    Posts
    • #6134
      DataFlair TeamDataFlair Team
      Spectator

      What is Hadoop Archives used for?
      Explain Hadoop Archives?

    • #6137
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop Archives are used to solve small size file problem(in which no. of input files are very small compared to block size).

      Hadoop Archives combines large no. of small input files into one archive file. The archive will always have .har extension.

      The contents of har file are-
      _master
      _index
      part file

      Part files are nothing but the original files concatenated together into a big file. Index files are look up files which are used to look up the individual small files inside the big part files.

      Instead of multiple no. of small files, this one archive file is fed to MapReduce job hence improving performance in data processing.

      Har file can be created using below Command-

      archive -archiveName NAME -p <parent path> [-r <replication factor>]<src>* <dest>

      Once a .har file is created, you can do a listing on the .har file and you will see it contains index files and part files.

      hadoop fs -ls /output/location/myhar.har

      /output/location/myhar.har/_index

      /output/location/myhar.har/_masterindex

      /output/location/myhar.har/part-0.
      You can run cat command on part file to see the concatenated contents of all the input files.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.