What is the problem with small files in Hadoop?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What is the problem with small files in Hadoop?

Viewing 3 reply threads
  • Author
    Posts
    • #5955
      DataFlair TeamDataFlair Team
      Spectator

      What is Small File Problem in HDFS?
      How can be resolve small file problem in hadoop?
      How to deal with small file problem in Hadoop?

    • #5957
      DataFlair TeamDataFlair Team
      Spectator

      What is Small File Problem in HDFS?

      The file Which is significantly smaller than a block size is called small file.
      Now, How small files creating problems?
      The small size problem is 2 folds.
      Small File problem in HDFS and Small File Problem in MapReduce

      Small File problem in HDFS

      Since each file or directory is an object in a name node’s memory of size 150 byte, that much memory is not feasible.
      It increases the file seeks and hopping from one data node to another.
      Solution:
      1) HAR (Hadoop Archive) Files has been introduced to deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command, HAR files are created, which runs a MapReduce job to pack the files being archived into smaller number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.

      2) Sequence Files also deal with small file problem, in which we use the filename as key and the file contents as the value. If we have 10,000 files of 100 KB, we can write a program to put them into a single sequence file, and then we can process them in a streaming fashion.

      Small File Problem In mapReduce:
      The problem with MapReduce: Number of files increases the number of mappers. Normally map tasks process a block of input at a time. If there are multiple small files, then the number of inputs will increase according to the number of map tasks/files. It makes the process so slow.
      For example, if there is a client which processing 80,000 files, then it needs 80,000 mappers.

      Follow the link to learn more about Small File Problem in Hadoop

    • #5958
      DataFlair TeamDataFlair Team
      Spectator

      A file which is less than the Block size(64mb/128mb) is referred to a small file.

      Firstly for storing every file in HDFS a header of 150 bytes is needed. using this amount of space for a large number of small files is not recommended on the commercial hardware used for Name node. This would not lead to best utilization of Name node hardware.
      Secondly, whenever there are multiple requests for reading files the disk seek time will rapidly increase as there are many data blocks and hoping from one data node to other data node consumes time.
      MapReduce creates a task to work on for a single file. If working with a large number of small files there will be many tasks in queue and lot of overhead will be created.

    • #5960
      DataFlair TeamDataFlair Team
      Spectator

      If HDFS is used for very small sized files . Its not the ideal battleground for HDFS.
      The 3 main issues are following:
      1) The client needs to interact with Master huge number of times.
      2) Each node will have too much file chunks resulting in too much pressure on the network connection.
      3) Metadata will be too huge since MAster has too save all the info for all the files

Viewing 3 reply threads
  • You must be logged in to reply to this topic.