What is a small file problem?

Viewing 0 reply threads
  • Author
    Posts
    • #6349
      DataFlair TeamDataFlair Team
      Spectator

      A small file can be defined as any file that is significantly smaller than the Hadoop Block size.
      Hadoop says that you store the small number of large files instead of storing a large number of small files.
      Let’s assume we have 100 files of the following size:
      F1 -1 mb
      F2 – 1 mb
      F3 – 1 mb
      .
      .
      .
      F99 – 1mb
      In HDFS as soon as we try to put the data into HDFS using put command or copy from local or move command or anything else, client will interact with a Name node and request for creating a file and make an entry into the metadata. So, that information will be in Name Node RAM.
      Let’s assume our block size is 128 MB and replication factor is 3. Now, as soon as we say file F1 it will create 1 block and replication as 3.

      .Suppose B1 is stored into Machine 1, Machine 2 and Machine 3.
      So, we have File to Block mapping and Block to Data Node Mapping.This information will be stored in Data Node RAM in the form of objects. Each object will occupy say 150 bytes of memory.So, if the file is a small file and has only 1 block then it will occupy 150 bytes of memory in the Name Node. Now, to store F1 we need 150 bytes of memory, to store F2 We need 150 bytes of memory and so on. To store 100 files i.e. 100 MB data we need to make use of 15 x 100 = 1500 bytes of memory in Name Node RAM memory.
      Consider another file “IdealFile” of size 100 MB, we need one block here i.e. B1 that is stored in Machine 1, Machine 2 , Machine 3. This will occupy 150 MB memory in Name Node RAM.
      So, for each and every file Name Node needs to create metadata which will be stored in NameNode RAM.So, if we have small number of files then it obviously will be a burden to Name Node. This is the small file problem.So, there are two problems which are small file problems:
      Problem 1: Problem with Storage: This is explained as above.
      Problem 2: Problem with Processing: Each block(or one Input Split) gets processed by one MapTask. So, for 100 blocks as listed above we need 100 Mappers. But for the file IdealFile which is of size 100 mb we need only one Mapper .

      Solution to the above two Small file Problems:
      The solution for Storage:
      • Compress or Archive the data.
      • Create Sequence File. Sequence File is a file which will be in (key, value)pair format. Where Key could be the file name and value could be the content of a file.
      Solution for Processing:
      • Sequence File input format needs 1 mapper.
      • Combined File Input Format in MapReduce: We need to keep all the small files in a directory and we will give the directory as input to the mapper.We will split our input file size.
      (Input Split: It is a logical representation of chunk of data which will be processed by Single Mapper.).We will go with Combine File Input Format when all files are of same type.

Viewing 0 reply threads
  • You must be logged in to reply to this topic.