What is small file problem in Hadoop?

Viewing 3 reply threads
  • Author
    Posts
    • #6307
      DataFlair TeamDataFlair Team
      Spectator

      What is small file problem in Hadoop?
      What is the size of file to consider as small ?
      Can hadoop Handle small files efficiently ?
      What is preferable in Hadoop tons of small files or less number of large files ?

    • #6309
      DataFlair TeamDataFlair Team
      Spectator

      HDFS
      The Hadoop Distributed File System- HDFS is a distributed file system. Hadoop is mainly designed for batch processing of large volume of data. The default Data Block size of HDFS is 128 MB. When file size is significantly smaller than the block size the efficiency degrades.

      Storing a lot of small files which are extremely smaller than the block size cannot be efficiently handled by HDFS. In namenode’s memory, every file, directory, and the block in HDFS is represented as an object. Each of these objects is in the size of 150 bytes. If we consider 10 Million small files, each of these files will be using a separate block. That will cause to a use of 3 gigabytes of memory. With the hardware limitations have, scaling up beyond this level is a problem. With a lot of files, the memory required to store the metadata is high and can not scale beyond a limit.

      MapReduce
      In MapReduce, Map task process a block of data at a time. Many small files mean lots of blocks – which means lots of tasks, and lots of bookkeeping by Application Master. This will slow the overall cluster performance compared to large files processing.

    • #6310
      DataFlair TeamDataFlair Team
      Spectator

      Small File- A file is considered small when its size much smaller than the default HDFS Block(64/128/256 MB).
      Problems with Small file-
      1) Name Node memory is heavily consumed- Metadata stored for each file is usually 150 bytes. Say, there are some millions of small files . GBs of name node memory is used.
      2) MapReduce performance is degraded- While processing,Mapper picks blocks randomly for processing . Performance is always better while processing few large sequential data than several random data.
      3) Huge number of JVM’s- each block is picked by one map and each map is processed in one JVM. Many small files implies many blocks which means spinning up of many JVM’s.
      4) Queuing- Hadoop can handle only a predefined set of mapreduce tasks concurrently . When map tasks out grow the number of concurrent map tasks, all the remaining tasks are queued for open slots, causing latency..

    • #6311
      DataFlair TeamDataFlair Team
      Spectator

      Defination of Small Files:

      A file is defined as small if its size is significantly less than the defined block size of Hadoop i.e 128mb.
      if a file doesn’t fill 70-75% of block it can be considered as the small file.
      We have one more issue, say we have large file i.e which has a size greater than Hadoop Block size. For example 130mb, so now we have 2mb blocks.I think solving this issue is little simple as we just need to increase the block size.
      Solving small files will require complex solutions.

      Source of Small Files:

      Hadoop takes up data from sensors(real time data) which may be small files per second.
      During map-reduce, if a large number of data goes to one reducer, and small data to other reducer’s, then this reducer will produce small output files.

      Small File Problem on Hadoop:

      Two major issues are:
      1) High NameNode Memory Utilization.
      2) Performance of MapReduce Reduces.

      NameNode Memory Issue:

      Objects in namenode are the directory, block, and files, and all these objects are in memory. This object requires 150 bytes. Let’s assume we have 10 million files each needing a block, so apprx namenode will need 3gb-4gb.
      As we scale up it will increase, say 100gb-200gb. Now every time Namenode starts, it reads metadata of each file from local cache. So this will read 100-200gb of data from local disk, which will cause a lot of delay in Hadoop startup.

      Data nodes have to report to namenode always so that namenode can keep track of all datanodes and its data blocks.Now here we have a lot of blocks sending the report, which will eat up network bandwidth.
      So small file causes namenode memory issue, impact on the network, delayed startup.

      Performance of MapReduce Reduces:

      With a large number of small files, a large number of disk io operation is required during MapReduce. This will impact the performance of mapreduce, compared to a large sequential read operation which will be fast.

      As we know, one map task has one block being processed. Now, let’s compute, if we have 5000 files each of 10mb, 5000 map task will be scheduled (1 block = 1 file of 10mb) and moreover, each map task runs in its own JVM.
      So if we have such large number of map tasks, will require a large number of nodes.Suppose the cluster size is less, the tasks will be queued, thus causing delay.

      Let’s consider, we have had 40 files each 128mb, we just need 40 map task instead of 5000 files each 10mb. This takes fewer resources, less network, less disk io and increased MapReduce performance.

Viewing 3 reply threads
  • You must be logged in to reply to this topic.