what is file splitting and decide no of mapper.

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop what is file splitting and decide no of mapper.

Viewing 1 reply thread
  • Author
    Posts
    • #5298
      DataFlair TeamDataFlair Team
      Spectator

      what is file splitting how the mappers are decided like no of mappers?

    • #5300
      DataFlair TeamDataFlair Team
      Spectator

      In HDFS when the client presents a large file, let us say of size 1GB, it is broken up into blocks size of which is pre-defined through the parameter dfs.blocksize. If the size of a block is defined as 128MB then 1GB / 128MB = 8 blocks will be created and will be stored along with their replicas across multiple datanodes in a distributed manner in a cluster.

      Now, imagine there are two files, file1 of size 356MB and file2 of size 260MB. There shall be 3 blocks (128 + 128 + 100)MB for file1 and 3 blocks (128 + 128 + 4)MB for file2. For a MapReduce job the entire file(s) is actually split into chunks (not necessarily having the same size as a block) according to InputSplit parameter and the InputFormat class. InputSplit represents the data which is processed by an individual mapper when further divided into records.

      File splitting happens based on file offsets. Split is a logical division of a file where end of the records (lines) shall be taken into the split whereas block is a physical division of a file where a single record may span across two blocks. Split is generally used for a MR job and basically controls the number of mappers in a MR job which is equal to number of splits. Default values will be used for split size if user has not defined the parameter mapred.min.split.size and in mapred-site.xml.

      Thus, for the file of size 1GB and the default split size of 128MB each split will exactly be an HDFS block. On the other hand, for two files there shall be 3 splits for file1 of size 356MB (same as blocks) and 2 splits for file2 of size 260MB (instead of 3 as for blocks). This happens because the 3rd block of size 4MB remains a part of the 2nd split as controlled by the parameter SPLIT_SLOP which is by default 1.1 or 10% exceed of the last block. It is quite expensive to run a map task for just a few megabytes. If the size of the dangling block exceeds even the SPLIT_SLOP calculation then split size only exceeds to accommodate the incomplete record and the rest goes into a new split and hence a new mapper will have to be initialized.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.