what is file splitting and decide no of mapper.

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 2:47 pm #5298
  
  DataFlair Team
  Spectator
  
  what is file splitting how the mappers are decided like no of mappers?
- September 20, 2018 at 2:47 pm #5300
  
  DataFlair Team
  Spectator
  
  In HDFS when the client presents a large file, let us say of size 1GB, it is broken up into blocks size of which is pre-defined through the parameter dfs.blocksize. If the size of a block is defined as 128MB then 1GB / 128MB = 8 blocks will be created and will be stored along with their replicas across multiple datanodes in a distributed manner in a cluster.
  
  Now, imagine there are two files, file1 of size 356MB and file2 of size 260MB. There shall be 3 blocks (128 + 128 + 100)MB for file1 and 3 blocks (128 + 128 + 4)MB for file2. For a MapReduce job the entire file(s) is actually split into chunks (not necessarily having the same size as a block) according to InputSplit parameter and the InputFormat class. InputSplit represents the data which is processed by an individual mapper when further divided into records.
  
  File splitting happens based on file offsets. Split is a logical division of a file where end of the records (lines) shall be taken into the split whereas block is a physical division of a file where a single record may span across two blocks. Split is generally used for a MR job and basically controls the number of mappers in a MR job which is equal to number of splits. Default values will be used for split size if user has not defined the parameter mapred.min.split.size and in mapred-site.xml.
  
  Thus, for the file of size 1GB and the default split size of 128MB each split will exactly be an HDFS block. On the other hand, for two files there shall be 3 splits for file1 of size 356MB (same as blocks) and 2 splits for file2 of size 260MB (instead of 3 as for blocks). This happens because the 3rd block of size 4MB remains a part of the 2nd split as controlled by the parameter SPLIT_SLOP which is by default 1.1 or 10% exceed of the last block. It is quite expensive to run a map task for just a few megabytes. If the size of the dangling block exceeds even the SPLIT_SLOP calculation then split size only exceeds to accommodate the incomplete record and the rest goes into a new split and hence a new mapper will have to be initialized.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

what is file splitting and decide no of mapper.

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses