How to optimize MapReduce jobs in Hadoop

Viewing 1 reply thread
  • Author
    Posts
    • #5428
      DataFlair TeamDataFlair Team
      Spectator

      How to fine tune / improve the performance of MapReduce jobs?
      What are Hadoop performance tuning tools?
      What are the techniques / best practices to optimize the MapReduce job?

    • #5429
      DataFlair TeamDataFlair Team
      Spectator

      We must consider points for tunning of Mapreduce jobs-

      1) Memory Tunning- For maximum performance for a Hadoop job, is to tune the best configuration parameters for memory, by monitoring the memory usage on the server. Swap memory usage can be monitored using software like Ganglia, Nagios or Cloudera Manager.We can get other necessary options like CPU, disk and network information from the job itself. Whenever there is excess swap memory utilization, memory usage should be optimized by configuring the mapred.child.java.opts property by reducing the amount of RAM that is allotted to each task in mapred.child.java.opts.

      2) Improving IO Performance- The mount points for DataNode or data directories should be configured with the no time option to ensure that the metadata is not updated by the NameNode every time the data is accessed.
      It is recommended not to use LVM and RAID on DataNode machines as it reduces performance.

      3) Minimizing Or Compressing Map Output- Disk IO is one of the major performance bottleneck and here are ways to minimize:
      Ensure that the mapper for your MapReduce job uses 70% of heap memory. So, When the Map Output is very large, intermediate data size should be reduced using various compression techniques like LZO, BZIP, Snappy, etc.
      Map Output is not compressed by default and to enable compression of Map Output – MapReduce.map.output.compress should be set to true.

      4) Tune number of Mappers or Reducers-
      If the MapReduce job has more than 1 terabyte of input, then, to ensure that the number of tasks is smaller- the Block size of the input dataset should be increased to 512M or 256M. The block size of existing files can be modified by configuring the dfs.block.size property. Once the command to change the block size is executed the original data can be removed.
      If the MapReduce job on the Hadoop cluster launches several map tasks wherein each task completes in just a few seconds -then reducing the number of maps being launched for that application without impacting the configuration of the Hadoop cluster will help optimize performance
      If a task takes less than 30 seconds to execute, then it is better to reduce the number of tasks.

      5) Combiner-
      If the job performs a large shuffle wherein the map output is several GBs per node writing a combiner can help optimize the performance. Combiner acts as an optimizer for the MapReduce job. It runs on the output of the Map phase to reduce the number of intermediate keys being passed to the reducers. This reduces the load of the reduce task in processing the business logic.

      6) Using Skewed Joins-
      If there is a huge amount of data for a single key, then one of the reducers will be held up with processing majority of the data this is when Skewed join comes to the rescue. Skewed join computes a histogram to find out which key is dominant and then data is split based on its various reducers to achieve optimal performance.

      7) Speculative Execution-
      The performance of MapReduce jobs is seriously impacted when tasks take a long time to finish execution. Speculative execution is a common approach to solve this problem by backing up slow tasks on alternate machines. Setting the configuration parameters mapreduce.map.tasks.speculative.execution and mapreduce.reduce.tasks.speculative.execution to true will enable speculative execution so that the job execution time is reduced if the task progress is slow due to memory unavailability.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.