what is the small size problem in Hadoop

Viewing 2 reply threads
  • Author
    Posts
    • #5273
      DataFlair TeamDataFlair Team
      Spectator

      what is the small size problem

    • #5274
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is not suited for small data.
      Small file problem in HDFS:
      Hadoop HDFS lacks the ability to support the random reading of small files. Small file in HDFS is smaller than the HDFS Block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can’t handle these lots of files. HDFS works with the small number of large files for storing large datasets. It is not suitable for a large number of small files. Large number of many small files overloads NameNode since it stores the namespace of HDFS.
      Solution 
      HAR (Hadoop Archive) Files- HAR Files deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command we can create HAR files. These file runs a MapReduce job to pack the arhived files into a smaller number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS because each HAR file access requires two index files read as well the data file to read, this makes it slower.
      Sequence Files- Sequence Files also deal with small file problem. In this we use the filename as key and the file contents as the value. Suppose, we have 100 KB, 10,000 files, we can write a program to put them into a single sequence file. And then we can process them in a streaming fashion.
      Small file problem in MapReduce:
      Number of Mappers increases as the number of files increases. Mapper process a data block of input at a time, if there are multiple small files, then number of inputs will increase according to the number of map task. This will slow down the processing. Suppose, if client is processing 80,000 files, then it will need 80,000 mappers.

    • #5278
      DataFlair TeamDataFlair Team
      Spectator

      Hadoop is designed to process large volume of data in batch jobs. Default HDFS Block is 128MB.
      But when large volume is represented in number of small files, which are significantly smaller than the default block size
      will degrade its performance because of two main reasons.

      1.Name Node Memory Problem:

      In Name node’s memory each file, the block will be represented as an object, each of 150 bytes. If we process the large volume of small files, let’s say 10M each will be stored in the separate block which means memory required to hold metadata increases and also a number of seeks to retrieve the information from the datanode will increases.

      2. MapReduce Performance problem:

      In Map Reduce, Map function will process one block of data at a time. if we schedule 10000 small files each of 10MB then MapReduce will produce 10000 Mappersfunctions to process each block.
      So that the job time will be tens or hundreds of times slower than the equivalent one with a single input file.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.