How to calculate number of mappers in Hadoop?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop How to calculate number of mappers in Hadoop?

Viewing 3 reply threads
  • Author
    Posts
    • #6280
      DataFlair TeamDataFlair Team
      Spectator

      How to set no of mappers for a MapReduce job?
      When we submit a MapReduce job how many map tasks run in Hadoop?
      How many Mappers run for a MapReduce job in Hadoop?

    • #6282
      DataFlair TeamDataFlair Team
      Spectator

      Let first understand what is Block?
      Blocks: Blocks are nothing but physical division of actual data. Actual data is splitted into the number of blocks and size of each block is same. By default, size of each block is either 128 MB or 64 MB depending upon the Hadoop. version. These blocks are stored distributedly on data nodes.

      Benefits of distributed blocks in terms of processing:
      Since, blocks are stored distributedly , hadoop can perform any operation on these blocks in parallel. Parallel operations on blocks helps to reduce the data processing time.

      Now, wa are clear about the blocks and advantage we can gain with distributed blocks. Lets have some more insight about the block processing done by the MapReduce.
      MapReduce as the name indicates it’s combination of Map and Reduce word. In the MapReduce framework, map and reduce are functions. These functions are also called as Mappers and Reducer functions.

      Now, we will just concentrate about the Mapper and it’s role.
      Mapper nothing but Map function is used to perform customer operation defined by the client on data. Since, in the HDFS system, data gets divided into blocks and stored distributedly, so to perform the Mapper operation on data nothing but distributed data (block), mapper function or map task is called on each distributed data (block) by doing this map tasks runs in parallel on distributed blocks.
      It simply means that count of map task / mapper are equal to no of blocks needed for processing of data.
      No. of map task = Data size / block size.

      Let’s understand this with example.
      Suppose there is 1GB (1024 MB) of data needs to be stored and processed by the hadoop.
      So, while storing the 1GB of data in HDFS, hadoop will split this data into smaller chunk of data. Consider, hadoop system has default 128 MB as split data size.
      Then, hadoop will store the 1 TB data into 8 blocks (1024 / 128 = 8 ). So, for each processing of this 8 blocks i.e 1 TB of data , 8 mappers are required.

      Follow the link for more detail: Mappers in Hadoop

    • #6284
      DataFlair TeamDataFlair Team
      Spectator

      Calculate number of Mappers in Hadoop

      Firstly it depends on if the files can be split by Hadoop (splittable) or not. Most files can be split. Example of files that cannot be split are some zipped or compressed files.

      Splittable files

      1) Calculate the total size of the input files by adding the size of all the files

      2) No of Mappers = Total size calculated / Input split size defined in Hadoop configuration (*NOTE 1*)

      (e.g)

      Total size calculated = 1GB (1024MB)
      
      Input split size = 128MB
      
      No of Mappers = 8 (1024 / 128)

      (*NOTE 1*)

      FileInputFormat is the base class for all implementations of InputFormat that use files as their data source.

      Properties for controlling split size are

      Mapreduce.input.fileinputformat.split.minsize – default value: 1 byte

      Mapreduce.input.fileinputformat.split.maxsize – default value: 8192 PB (petabytes)

      Dfs.blocksize – 128 MB (megabytes)

      Splitsize = Max(Minumsize, Min(Maximumsize, Blocksize))

      Usually minsize < blocksize < maxsize so splitsize = blocksize

      1 byte < 128MB < 8192PB =====> Splitsize = 128MB

      For example if maxsize = 64MB and blocksize = 128 MB

      Then splitsize will be limited to maxsize

      minsize < maxsize < blocksize so splitsize = maxsize

      1 < 64MB < 128MB =====> Splitsize = 64MB

      Non-Splittable files

      1) No if Mappers = Number of Input files (*NOTE 2*)

      (*NOTE 2*)

      If the size of file is too huge, it can be a bottleneck to the performance of the whole MapReduce job.

      Follow the link for more detail: Mappers in Hadoop

    • #6286
      DataFlair TeamDataFlair Team
      Spectator

      It depends on the no of files and file size of all the files individually.
      Assuming files are configured to split(ie default behavior)
      Calculate the no of Block by splitting the files on 128Mb (default). Two files with 130MB will have four input split not 3. According to this rule calculate the no of blocks, it would be the number of Mappers in Hadoop for the job.
      If file splitting behaviour is changed to disable splitting then one mapper per file.

Viewing 3 reply threads
  • You must be logged in to reply to this topic.