- 1. Objective
- 2. Hadoop MapReduce Performance Tuning
- 3. 7 Tips for Hadoop Performance Tuning
- 3.1. Tuning Hadoop Run-time Parameters
- 3.2. Tuning Application Specific Performance
- 5. Conclusion
Performance tuning in Hadoop will help in optimizing the Hadoop cluster performance. This tutorial on Hadoop MapReduce performance tuning will provide you ways for improving your Hadoop cluster performance and get the best result from your programming in Hadoop. It will cover 7 important concepts like Memory Tuning in Hadoop, Map Disk spill in Hadoop, tuning mapper tasks, Speculative execution in Big data hadoop and many other related concepts for Hadoop MapReduce performance tuning.
2. Hadoop MapReduce Performance Tuning
Hadoop Performance tuning will help you in optimizing your Hadoop cluster performance and make it better to provide best results while doing Hadoop programming in Big Data companies. To perform the same, you need to repeat the process given below till desired output is achieved at optimal way.
Run Job –> Identify Bottleneck –> Address Bottleneck.
The first step is to run Hadoop job, Identify the bottlenecks and address them using below methods to get the highest performance. You need to repeat above step till a level of performance is achieved.
3. 7 Tips for Hadoop Performance Tuning
Here we are going to discuss the ways to improve the performance of Hadoop MapReduce. We have classified these ways into two categories.
- Hadoop run-time parameters based performance tuning.
- Hadoop application-specific performance tuning.
Let’s discuss how to improve the performance of hadoop cluster on the basis of these two categories.
3.1. Tuning Hadoop Run-time Parameters
There are many options provided by Hadoop on CPU, memory, disk, and network for performance tuning. Most Hadoop tasks are not CPU bounded, what is most considered is to optimize usage of memory and disk spills.
3.1.1. Memory Tuning
The most general and common rule for memory tuning is: use as much memory as you can without triggering swapping. The parameter for task memory is mapred.child.java.opts that can be put in your configuration file.
You can also monitor memory usage on the server using Ganglia, Cloudera manager, or Nagios for better memory performance.
3.1.2. Minimize the Map Disk Spill
Disk IO is usually the performance bottleneck in Hadoop. There are a lot of parameters you can tune for minimizing spilling like:
- Compression of mapper output
- Usage of 70% of heap memory ion mapper for spill buffer
But do you think frequent spilling is a good idea?
It’s highly suggested not to spill more than once as if you spill once, you need to re-read and re-write all data: 3x the IO.
3.1.3. Tuning Mapper Tasks
The number of mapper tasks is set implicitly unlike reducer tasks. The most common tuning way for the mapper is controlling the amount of mapper and the size of each job. When dealing with large files, Hadoop split the file into smaller chunks so that mapper can run it in parallel. However, initializing new mapper job usually takes few seconds that is also an overhead to be minimized. Below are the suggestions for the same:
- Reuse jvm task
- Aim for map tasks running 1-3 minutes each. For this if the average mapper running time is lesser than one minute, increase the mapred.min.split.size, to allocate less mappers in slot and thus reduce the mapper initializing overhead.
- Use Combine file input format for bunch of smaller files.
3.2. Tuning Application Specific Performance
Let’s now discuss the tips to improve the Application specific performance in Hadoop.
3.2.1. Minimize your Mapper Output
Minimizing the mapper output can improve the general performance a lot as this is sensitive to disk IO, network IO, and memory sensitivity on shuffle phase.
For achieving this, below are the suggestions:
- Filter the records on mapper side instead of reducer side.
- Use minimal data to form your map output key and map output value in Map Reduce.
- Compress mapper output
3.2.2. Balancing Reducer’s Loading
Unbalanced reducer tasks create another performance issue. Some reducers take most of the output from mapper and ran extremely long compare to other reducers.
Below are the methods to do the same:
- Implement a better hash function in Partitioner class.
- Write a preprocess job to separate keys using MultipleOutputs. Then use another map-reduce job to process the special keys that cause the problem.
3.2.3. Reduce Intermediate data with Combiner in Hadoop
Implement a combiner to reduce data which enables faster data transfer.
3.2.4. Speculative Execution
When tasks take long time to finish the execution, it affects the MapReduce jobs. This problem is being solved by the approach of speculative execution by backing up slow tasks on alternate machines.
You need to set the configuration parameters ‘mapreduce.map.tasks.speculative.execution’ and ‘mapreduce.reduce.tasks.speculative.execution’ to true for enabling speculative execution. This will reduce the job execution time if the task progress is slow due to memory unavailability.
There are several performance tuning tips and tricks for a Hadoop Cluster and we have highlighted some of the important ones. For more tricks to improve Hadoop cluster performance, check Job optimization techniques in Big data Hadoop.
If you like this blog, or you have any query related to Hadoop MapReduce performance tuning tips, so leave a comment in a comment box. We will be glad to solve them.