Small File Problem in Hadoop

Viewing 2 reply threads
  • Author
    Posts
    • #5534
      DataFlair TeamDataFlair Team
      Spectator

      Can Hadoop handle small files efficiently ? What happens when we store small files in Hadoop ? What is small file problem ? How to resolve small file problem ?

    • #5536
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is not suited for small data. Hadoop HDFS lacks the ability to efficiently support the random reading of small files because of its high capacity design. Small file in HDFS is significantly smaller than the HDFS block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can’t handle these lots of files, as HDFS was designed to work properly with the small number of large files for storing large datasets rather than a large number of small files.

      Following are the issues with the large number of small files in Hadoop:

      Every file in HDFS is mapped to an object and stored in Namenode’s memory. So, a large number of small files will end up using a lot of memory of the master and scaling up in this fashion is not feasible.
      When there are large number of files, there will be a lot of seeks on the disk as frequent hopping from data node to data node will be done and hence increasing the file read/write time.
      Solution

      HAR (Hadoop Archive) Files has been introduced to deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command, HAR files are created, which runs a MapReduce job to pack the files being archived into smaller number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.
      Sequence Files also deal with small file problem, in which we use the filename as key and the file contents as the value. If we have 10,000 files of 100 KB, we can write a program to put them into a single sequence file, and then we can process them in a streaming fashion.
      Follow the link to learn more about Small File Problem in Hadoop

    • #5538
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is a distributed file system, mainly designed for the large volume of data. The default block size is 128MB. So while reading small files system will keep on searching from one datanode to another to retrieve the file.

      In MapReduce program, Map tasks process a block of input at a time. If the file size becomes very small the input for each process will be little and there will be a lot of files, so there will be a lot of Map tasks.

      But there are multiple solutions available for this problem like we have

      1) Consolidator
      2) Using HBase storage
      3) Sequence Files
      4) HAR

Viewing 2 reply threads
  • You must be logged in to reply to this topic.