How will we create single mapper for small files

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 5:04 pm #6072
  
  DataFlair Team
  Spectator
  
  Earlier we have discussed about small size file problem in Hadoop. Now let suppose we have 10 small size file in HDFS so they would be requiring 10 mappers to run. How to configure MapReduce job such that there will be just one mapper running instead of 10 mapper ?
  
  The solution would be very helpful if we are getting data in real time using flume. Where we will be having tons of small files created (files of size KBs). If for each small file one mapper will run, we can imagine there would be requirement of thousands of Mappers. Now for just a few KBs of data, a JVM will be initialized, that will be a big overhead, this will degrade the performance of MapReduce.
  
  On the other hand, if we combine these file (either by merging or by creating Hadoop archive (*.HAR)) the performance is very good. Now if I can’t combine these files then is there any way to configure the Map Reduce job in such a way that multiple files will be combined and only one mapper will run for that.
- September 20, 2018 at 5:04 pm #6074
  
  DataFlair Team
  Spectator
  
  The small size file problem can be solved by creating your own ExtendedCombineFileInputFormat which extends CombineFileInputFormat.
  
  Using CombineFileInput, we can process multiple small files (each file
  with size less than HDFS block size), in a single map task.
- September 20, 2018 at 5:04 pm #6076
  
  DataFlair Team
  Spectator
  
  In Hadoop, dealing with large number of small files is required, as these will be overhead for the job performance. If each of the file size is less than the HDFS block size, then the no. of files will be equal to the no. of the blocks that would be created.
  And, if we run a MapReduce job across those files, then large no. of mapper will be created, for which each JVM will be initialized to each of the map task, which is performance overhead.
  
  How will we create single mapper for small files?
  
  To overcome this large no. of small file problems, Hadoop provides an abstract class – CombineFileInputFormat.
  We have to create our own custom CombineFileInputFormat which extends CombileFileInputFormat.
  CombineFileInputFormat packs many files(size of each file < HDFS block size) into a single split, providing more data for a map task to process.
  Hence, single mapper can used for processing multiple small files.
  
  CombineFileInputFormat takes node and rack locality into account when deciding which blocks to place in the same split so that, performance is not compromised.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

How will we create single mapper for small files

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses