What is the problem with small files in Hadoop?

This topic has 3 replies, 1 voice, and was last updated 7 years, 10 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 4:50 pm #5955
  
  DataFlair Team
  Spectator
  
  What is Small File Problem in HDFS?
  How can be resolve small file problem in hadoop?
  How to deal with small file problem in Hadoop?
- September 20, 2018 at 4:50 pm #5957
  
  DataFlair Team
  Spectator
  
  What is Small File Problem in HDFS?
  
  The file Which is significantly smaller than a block size is called small file.
  Now, How small files creating problems?
  The small size problem is 2 folds.
  Small File problem in HDFS and Small File Problem in MapReduce
  
  Small File problem in HDFS
  
  Since each file or directory is an object in a name node’s memory of size 150 byte, that much memory is not feasible.
  It increases the file seeks and hopping from one data node to another.
  Solution:
  1) HAR (Hadoop Archive) Files has been introduced to deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command, HAR files are created, which runs a MapReduce job to pack the files being archived into smaller number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.
  
  2) Sequence Files also deal with small file problem, in which we use the filename as key and the file contents as the value. If we have 10,000 files of 100 KB, we can write a program to put them into a single sequence file, and then we can process them in a streaming fashion.
  
  Small File Problem In mapReduce:
  The problem with MapReduce: Number of files increases the number of mappers. Normally map tasks process a block of input at a time. If there are multiple small files, then the number of inputs will increase according to the number of map tasks/files. It makes the process so slow.
  For example, if there is a client which processing 80,000 files, then it needs 80,000 mappers.
  
  Follow the link to learn more about Small File Problem in Hadoop
- September 20, 2018 at 4:50 pm #5958
  
  DataFlair Team
  Spectator
  
  A file which is less than the Block size(64mb/128mb) is referred to a small file.
  
  Firstly for storing every file in HDFS a header of 150 bytes is needed. using this amount of space for a large number of small files is not recommended on the commercial hardware used for Name node. This would not lead to best utilization of Name node hardware.
  Secondly, whenever there are multiple requests for reading files the disk seek time will rapidly increase as there are many data blocks and hoping from one data node to other data node consumes time.
  MapReduce creates a task to work on for a single file. If working with a large number of small files there will be many tasks in queue and lot of overhead will be created.
- September 20, 2018 at 4:50 pm #5960
  
  DataFlair Team
  Spectator
  
  If HDFS is used for very small sized files . Its not the ideal battleground for HDFS.
  The 3 main issues are following:
  1) The client needs to interact with Master huge number of times.
  2) Each node will have too much file chunks resulting in too much pressure on the network connection.
  3) Metadata will be too huge since MAster has too save all the info for all the files
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

What is the problem with small files in Hadoop?

About DataFlair

Trending Courses in Indore

Trending Courses in Bangalore

Trending Courses in Chennai

Trending Courses in Pune

Trending Courses in Hyderabad

Trending Courses in Delhi NCR