How to resolve small file problem in Hadoop hdfs?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop How to resolve small file problem in Hadoop hdfs?

Viewing 2 reply threads
  • Author
    Posts
    • #5574
      DataFlair TeamDataFlair Team
      Spectator

      Small file problem is a well-known issue in Hadoop. Hadoop is not optimized to handle/store small files but if we want to store millions of small files, how to handle the same in HDFS?
      How to resolve small file problem in Hadoop?

    • #5575
      DataFlair TeamDataFlair Team
      Spectator

      Problems with HDFS:

      With lots of small files, you need to maintain lots of metadata in your Namenode which is limited in size.
      More you have the small file, longer it will take to restart the cluster.
      Every file, directory, and Blocks occupy 150 bytes, as a rule of thumb. Hadoop is deigned for streaming access of large files. With small files, lots of seeking and hopping between data nodes happen
      Problems with MapReduce:

      Map process a block of input at a time. Lots of small files leads to as many mapping which then makes the cluster slow.

      Solution:
      We group the files in a larger file and for that, we can use HDFS’s sncy() or write a program or we can use methods:

      1) HAR files: It builds a layer system on top of HDFS. HAR command will create a HAR file which then runs a MapReduce job to avoid the files being archived into the small number of HDFS files.

      2) HBase: It is a type of storage which creates larger files depending on the access pattern of small files.

      3) Sequence File: Here we use the filename as key and content as value. We write a program to put lots of small files in a single Sequence File. They are splittable and so MapReduce can operate each chunk independently and in parallel.

      4) Filecrush tool: It will turn many small files into fewer larger files. It also changes from text to sequence.

    • #5576
      DataFlair TeamDataFlair Team
      Spectator

      Small files are a big problem in Hadoop. A file which is less than HDFS block size(64MB/128MB) is termed as small file.
      Namenode stores all files metadata in memory, so if you are storing lots of small files,namenode has to maintain its metadata, for a file metadata, it occupies 150 bytes so in the case of million files it would cost around 3GB of memory.
      Though it keeps the persistant copy of metadata in a disk, it needs to store the metadata in memory for fast retreival.
      So, because of small files, it would hamper the MapReduce computation also.

      Hadoop offers a few options to resolve small files problem:

      Hadoop Archive Files(HAR) :
      HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files
      Using HAR is a good idea, but reading through Har files is more slower than reading through files in HDFS.

      Sequence Files
      Another option is to use Sequence Files where you use the filename as the key and the file contents as the value.
      You can create MapReduce program convert lots of small files to into a single SequenceFile.
      SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently.
      They support block compression which is the best option.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.