Small File Problem in Hadoop

This topic has 2 replies, 1 voice, and was last updated 7 years, 8 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 3:34 pm #5534
  
  DataFlair Team
  Spectator
  
  Can Hadoop handle small files efficiently ? What happens when we store small files in Hadoop ? What is small file problem ? How to resolve small file problem ?
- September 20, 2018 at 3:34 pm #5536
  
  DataFlair Team
  Spectator
  
  Hadoop is not suited for small data. Hadoop HDFS lacks the ability to efficiently support the random reading of small files because of its high capacity design. Small file in HDFS is significantly smaller than the HDFS block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can’t handle these lots of files, as HDFS was designed to work properly with the small number of large files for storing large datasets rather than a large number of small files.
  
  Following are the issues with the large number of small files in Hadoop:
  
  Every file in HDFS is mapped to an object and stored in Namenode’s memory. So, a large number of small files will end up using a lot of memory of the master and scaling up in this fashion is not feasible.
  When there are large number of files, there will be a lot of seeks on the disk as frequent hopping from data node to data node will be done and hence increasing the file read/write time.
  Solution
  
  HAR (Hadoop Archive) Files has been introduced to deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command, HAR files are created, which runs a MapReduce job to pack the files being archived into smaller number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.
  Sequence Files also deal with small file problem, in which we use the filename as key and the file contents as the value. If we have 10,000 files of 100 KB, we can write a program to put them into a single sequence file, and then we can process them in a streaming fashion.
  Follow the link to learn more about Small File Problem in Hadoop
- September 20, 2018 at 3:34 pm #5538
  
  DataFlair Team
  Spectator
  
  Hadoop is a distributed file system, mainly designed for the large volume of data. The default block size is 128MB. So while reading small files system will keep on searching from one datanode to another to retrieve the file.
  
  In MapReduce program, Map tasks process a block of input at a time. If the file size becomes very small the input for each process will be little and there will be a lot of files, so there will be a lot of Map tasks.
  
  But there are multiple solutions available for this problem like we have
  
  1) Consolidator
  2) Using HBase storage
  3) Sequence Files
  4) HAR
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

Small File Problem in Hadoop

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses