How to resolve small file problem in Hadoop hdfs?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 3:41 pm #5574
  
  DataFlair Team
  Spectator
  
  Small file problem is a well-known issue in Hadoop. Hadoop is not optimized to handle/store small files but if we want to store millions of small files, how to handle the same in HDFS?
  How to resolve small file problem in Hadoop?
- September 20, 2018 at 3:41 pm #5575
  
  DataFlair Team
  Spectator
  
  Problems with HDFS:
  
  With lots of small files, you need to maintain lots of metadata in your Namenode which is limited in size.
  More you have the small file, longer it will take to restart the cluster.
  Every file, directory, and Blocks occupy 150 bytes, as a rule of thumb. Hadoop is deigned for streaming access of large files. With small files, lots of seeking and hopping between data nodes happen
  Problems with MapReduce:
  
  Map process a block of input at a time. Lots of small files leads to as many mapping which then makes the cluster slow.
  
  Solution:
  We group the files in a larger file and for that, we can use HDFS’s sncy() or write a program or we can use methods:
  
  1) HAR files: It builds a layer system on top of HDFS. HAR command will create a HAR file which then runs a MapReduce job to avoid the files being archived into the small number of HDFS files.
  
  2) HBase: It is a type of storage which creates larger files depending on the access pattern of small files.
  
  3) Sequence File: Here we use the filename as key and content as value. We write a program to put lots of small files in a single Sequence File. They are splittable and so MapReduce can operate each chunk independently and in parallel.
  
  4) Filecrush tool: It will turn many small files into fewer larger files. It also changes from text to sequence.
- September 20, 2018 at 3:41 pm #5576
  
  DataFlair Team
  Spectator
  
  Small files are a big problem in Hadoop. A file which is less than HDFS block size(64MB/128MB) is termed as small file.
  Namenode stores all files metadata in memory, so if you are storing lots of small files,namenode has to maintain its metadata, for a file metadata, it occupies 150 bytes so in the case of million files it would cost around 3GB of memory.
  Though it keeps the persistant copy of metadata in a disk, it needs to store the metadata in memory for fast retreival.
  So, because of small files, it would hamper the MapReduce computation also.
  
  Hadoop offers a few options to resolve small files problem:
  
  Hadoop Archive Files(HAR) :
  HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files
  Using HAR is a good idea, but reading through Har files is more slower than reading through files in HDFS.
  
  Sequence Files
  Another option is to use Sequence Files where you use the filename as the key and the file contents as the value.
  You can create MapReduce program convert lots of small files to into a single SequenceFile.
  SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently.
  They support block compression which is the best option.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

How to resolve small file problem in Hadoop hdfs?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses