what is the small size problem in Hadoop

This topic has 2 replies, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 2:44 pm #5273
  
  DataFlair Team
  Spectator
  
  what is the small size problem
- September 20, 2018 at 2:44 pm #5274
  
  DataFlair Team
  Spectator
  
  Hadoop is not suited for small data.
  Small file problem in HDFS:
  Hadoop HDFS lacks the ability to support the random reading of small files. Small file in HDFS is smaller than the HDFS Block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can’t handle these lots of files. HDFS works with the small number of large files for storing large datasets. It is not suitable for a large number of small files. Large number of many small files overloads NameNode since it stores the namespace of HDFS.
  Solution
  HAR (Hadoop Archive) Files- HAR Files deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command we can create HAR files. These file runs a MapReduce job to pack the arhived files into a smaller number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS because each HAR file access requires two index files read as well the data file to read, this makes it slower.
  Sequence Files- Sequence Files also deal with small file problem. In this we use the filename as key and the file contents as the value. Suppose, we have 100 KB, 10,000 files, we can write a program to put them into a single sequence file. And then we can process them in a streaming fashion.
  Small file problem in MapReduce:
  Number of Mappers increases as the number of files increases. Mapper process a data block of input at a time, if there are multiple small files, then number of inputs will increase according to the number of map task. This will slow down the processing. Suppose, if client is processing 80,000 files, then it will need 80,000 mappers.
- September 20, 2018 at 2:45 pm #5278
  
  DataFlair Team
  Spectator
  
  Hadoop is designed to process large volume of data in batch jobs. Default HDFS Block is 128MB.
  But when large volume is represented in number of small files, which are significantly smaller than the default block size
  will degrade its performance because of two main reasons.
  
  1.Name Node Memory Problem:
  
  In Name node’s memory each file, the block will be represented as an object, each of 150 bytes. If we process the large volume of small files, let’s say 10M each will be stored in the separate block which means memory required to hold metadata increases and also a number of seeks to retrieve the information from the datanode will increases.
  
  2. MapReduce Performance problem:
  
  In Map Reduce, Map function will process one block of data at a time. if we schedule 10000 small files each of 10MB then MapReduce will produce 10000 Mappersfunctions to process each block.
  So that the job time will be tens or hundreds of times slower than the equivalent one with a single input file.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

what is the small size problem in Hadoop

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses