What is Small File Problem in HDFS

Viewing 2 reply threads
  • Author
    Posts
    • #4667
      DataFlair TeamDataFlair Team
      Spectator

      What is Small File Problem in HDFS/Hadoop ?
      How to handle the small file problem ?
      Instead of merging is there any better alternative for the same ?

    • #4668
      DataFlair TeamDataFlair Team
      Spectator

      What is Small File in Hadoop

      A file which is significantly smaller than the HDFS Block size (default 128MB) is a Small file. If you’re storing lots of small files, then HDFS can’t handle lots of small files.

      Problem with Small File

      1. HDFS Level:
      In HDFS, each and every file, directory, and a block are presented as an object in the namenode’s memory in Hadoop. As a rule of thumb, each occupies 150 bytes. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up beyond this level will be the problem with current hardware.

      2. Map Reduce Level:
      The number of Mapper tasks depends on InputSplit . one mapper task for each block. If suppose we have large no of the small block for say 8000, then there will be 8000 mappers created for job i.e, one mapper for one block. So the number of maps reach to 8000. It will directly kill the processing capability of the system.

      Possible Solution of small File

      1. HAR (Hadoop Archive) Files –
      HAR has been introduced to deal with small file issue. HAR is basically dependent on hadoop archive command. HAR has introduced a layer on top of HDFS, which provided the interface for file access. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. Though it slows down the performance as compared to directly accessing the file from HDFS. Each block has an index value so HAR creates one more master index layer for a lot of small file index values.

      2. Sequence Files –
      We can also use a Sequence File. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. If we have 10,000 files of 100 KB, we can write a program to put them into a single Sequence File, and then you can process them in a streaming fashion.

    • #4669
      DataFlair TeamDataFlair Team
      Spectator

      Because of the below reasons, HDFS does not work well with lots of small files (much smaller than the Data Block size):

      • Each block will hold a single file, so we will have a lot of small (smaller than the configured block size). Reading all these blocks one by one means a lot of time will be spent with disk seeks.
      • The NameNode keeps track of each file and each block (about 150 bytes for each) and stores this data into memory, so, a large number of files will occupy more memory.
      • By using the default FileInputFormat, Map tasks usually, process a block of input at a time. If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead.

      Possible Solution –

      1. To aggregates, our small files, run an offline aggregation process. Then, after aggregation re-uploads the aggregated files ready for processing
      2. Add an additional Hadoopstep to the start of our job flow which aggregates the small files
      3. HAR (Hadoop Archive) Files –
      HAR has been introduced to deal with small file issue. HAR is basically dependent on hadoop archive command. HAR has introduced a layer on top of HDFS, which provided the interface for file access. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. Though it slows down the performance as compared to directly accessing the file from HDFS.

      4. Sequence Files –
      We can also use a Sequence File. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. If we have 10,000 files of 100 KB, we can write a program to put them into a single Sequence File, and then you can process them in a streaming fashion.

      5. HBase –
      HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if we need to do MapReduce style streaming analyses with the occasional random look up

Viewing 2 reply threads
  • You must be logged in to reply to this topic.