How will we create single mapper for small files

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop How will we create single mapper for small files

Viewing 2 reply threads
  • Author
    Posts
    • #6072
      DataFlair TeamDataFlair Team
      Spectator

      Earlier we have discussed about small size file problem in Hadoop. Now let suppose we have 10 small size file in HDFS so they would be requiring 10 mappers to run. How to configure MapReduce job such that there will be just one mapper running instead of 10 mapper ?

      The solution would be very helpful if we are getting data in real time using flume. Where we will be having tons of small files created (files of size KBs). If for each small file one mapper will run, we can imagine there would be requirement of thousands of Mappers. Now for just a few KBs of data, a JVM will be initialized, that will be a big overhead, this will degrade the performance of MapReduce.

      On the other hand, if we combine these file (either by merging or by creating Hadoop archive (*.HAR)) the performance is very good. Now if I can’t combine these files then is there any way to configure the Map Reduce job in such a way that multiple files will be combined and only one mapper will run for that.

    • #6074
      DataFlair TeamDataFlair Team
      Spectator

      The small size file problem can be solved by creating your own ExtendedCombineFileInputFormat which extends CombineFileInputFormat.

      Using CombineFileInput, we can process multiple small files (each file
      with size less than HDFS block size), in a single map task.

    • #6076
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop, dealing with large number of small files is required, as these will be overhead for the job performance. If each of the file size is less than the HDFS block size, then the no. of files will be equal to the no. of the blocks that would be created.
      And, if we run a MapReduce job across those files, then large no. of mapper will be created, for which each JVM will be initialized to each of the map task, which is performance overhead.

      How will we create single mapper for small files?

      To overcome this large no. of small file problems, Hadoop provides an abstract class – CombineFileInputFormat.
      We have to create our own custom CombineFileInputFormat which extends CombileFileInputFormat.
      CombineFileInputFormat packs many files(size of each file < HDFS block size) into a single split, providing more data for a map task to process.
      Hence, single mapper can used for processing multiple small files.

      CombineFileInputFormat takes node and rack locality into account when deciding which blocks to place in the same split so that, performance is not compromised.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.