Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › How to calculate number of mappers in Hadoop?
- This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 5:35 pm #6280DataFlair TeamSpectator
How to set no of mappers for a MapReduce job?
When we submit a MapReduce job how many map tasks run in Hadoop?
How many Mappers run for a MapReduce job in Hadoop? -
September 20, 2018 at 5:35 pm #6282DataFlair TeamSpectator
Let first understand what is Block?
Blocks: Blocks are nothing but physical division of actual data. Actual data is splitted into the number of blocks and size of each block is same. By default, size of each block is either 128 MB or 64 MB depending upon the Hadoop. version. These blocks are stored distributedly on data nodes.Benefits of distributed blocks in terms of processing:
Since, blocks are stored distributedly , hadoop can perform any operation on these blocks in parallel. Parallel operations on blocks helps to reduce the data processing time.Now, wa are clear about the blocks and advantage we can gain with distributed blocks. Lets have some more insight about the block processing done by the MapReduce.
MapReduce as the name indicates it’s combination of Map and Reduce word. In the MapReduce framework, map and reduce are functions. These functions are also called as Mappers and Reducer functions.Now, we will just concentrate about the Mapper and it’s role.
Mapper nothing but Map function is used to perform customer operation defined by the client on data. Since, in the HDFS system, data gets divided into blocks and stored distributedly, so to perform the Mapper operation on data nothing but distributed data (block), mapper function or map task is called on each distributed data (block) by doing this map tasks runs in parallel on distributed blocks.
It simply means that count of map task / mapper are equal to no of blocks needed for processing of data.
No. of map task = Data size / block size.
Let’s understand this with example.
Suppose there is 1GB (1024 MB) of data needs to be stored and processed by the hadoop.
So, while storing the 1GB of data in HDFS, hadoop will split this data into smaller chunk of data. Consider, hadoop system has default 128 MB as split data size.
Then, hadoop will store the 1 TB data into 8 blocks (1024 / 128 = 8 ). So, for each processing of this 8 blocks i.e 1 TB of data , 8 mappers are required.Follow the link for more detail: Mappers in Hadoop
-
September 20, 2018 at 5:36 pm #6284DataFlair TeamSpectator
Calculate number of Mappers in Hadoop
Firstly it depends on if the files can be split by Hadoop (splittable) or not. Most files can be split. Example of files that cannot be split are some zipped or compressed files.
Splittable files
1) Calculate the total size of the input files by adding the size of all the files
2) No of Mappers = Total size calculated / Input split size defined in Hadoop configuration (*NOTE 1*)
(e.g)
Total size calculated = 1GB (1024MB) Input split size = 128MB No of Mappers = 8 (1024 / 128)
(*NOTE 1*)
FileInputFormat is the base class for all implementations of InputFormat that use files as their data source.
Properties for controlling split size are
Mapreduce.input.fileinputformat.split.minsize – default value: 1 byte
Mapreduce.input.fileinputformat.split.maxsize – default value: 8192 PB (petabytes)
Dfs.blocksize – 128 MB (megabytes)
Splitsize = Max(Minumsize, Min(Maximumsize, Blocksize))
Usually minsize < blocksize < maxsize so splitsize = blocksize
1 byte < 128MB < 8192PB =====> Splitsize = 128MB
For example if maxsize = 64MB and blocksize = 128 MB
Then splitsize will be limited to maxsize
minsize < maxsize < blocksize so splitsize = maxsize
1 < 64MB < 128MB =====> Splitsize = 64MB
Non-Splittable files
1) No if Mappers = Number of Input files (*NOTE 2*)
(*NOTE 2*)
If the size of file is too huge, it can be a bottleneck to the performance of the whole MapReduce job.
Follow the link for more detail: Mappers in Hadoop
-
September 20, 2018 at 5:36 pm #6286DataFlair TeamSpectator
It depends on the no of files and file size of all the files individually.
Assuming files are configured to split(ie default behavior)
Calculate the no of Block by splitting the files on 128Mb (default). Two files with 130MB will have four input split not 3. According to this rule calculate the no of blocks, it would be the number of Mappers in Hadoop for the job.
If file splitting behaviour is changed to disable splitting then one mapper per file.
-
-
AuthorPosts
- You must be logged in to reply to this topic.