Hadoop Optimization | Job Optimization & Performance Tuning
1. Hadoop Optimization Tutorial: Objective
This tutorial on Hadoop Optimization will explain you Hadoop cluster optimization or MapReduce job optimization techniques that would help you in optimizing MapReduce job performance to ensure the best performance for your Hadoop cluster.
2. 6 Hadoop Optimization or Job Optimization Techniques
There are various ways to improve the Hadoop optimization. Let’s discuss each of them one by one-
i. Proper configuration of your cluster
- Dfs and MapReduce storage have been mounted with -noatime option. This disables access time and can improve I/O performance.
- Avoid RAID on TaskTracker and datanode machines, it generally reduces performance.
- Make sure you have configured mapred.local.dir and dfs.data.dir to point to one directory on each of your disks to ensure that all of your I/O capacity is used.
- Ensure that you have smart monitoring to the health status of your disk drives. This is 1 of the best practice for Hadoop MapReduce performance tuning. MapReduce jobs are fault tolerant, but dying disks can cause performance to degrade as tasks must be re-executed.
- Monitor the graph of swap usage and network usage with software like ganglia, Hadoop monitoring metrics. If you see swap being used, reduce the amount of RAM allocated to each task in mapred.child.java.opts.
ii. LZO compression usage
This is always a good idea for Intermediate data. Almost every Hadoop job that generates a non-negligible amount of map output will benefit from intermediate data compression with LZO. Although LZO adds a little bit of CPU overhead, it saves time by reducing the amount of disk IO during the shuffle.
In order to enable LZO compression set mapred.compress.map.output to true. This is one of the most important Hadoop optimization techniques.
iii. Proper tuning of the number of MapReduce tasks
- If each task takes 30-40 seconds or more, then reduce the number of tasks. The start of mapper or reducer process involves following things: first, you need to start JVM (JVM loaded into the memory), then you need to initialize JVM and after processing (mapper/reducer) you need to de-initialize JVM. All these JVM tasks are costly. Now consider a case where mapper runs a task just for 20-30 seconds and for this we have to start/initialize/stop JVM, which might take a considerable amount of time. It is recommended to run the task for at least 1 minute.
- If a job has more than 1TB of input, you should consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller. You can change the block size of existing files by using the command Hadoop distcp –Hdfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/inputdata-with-largeblocks
- So long as each task runs for at least 30-40 seconds, you should increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster.
- Don’t schedule too many reduce tasks – for most jobs, the number of reduce tasks equal to or a bit less than the number of reduce slots in the cluster.
iv. Combiner between mapper and reducer
If your algorithm involves computing aggregates of any sort, it is suggested to use a Combiner to perform some aggregation before the data hits the reducer. The MapReduce framework runs combine intelligently to reduce the amount of data to be written to disk and that has to be transferred between the Map and Reduce stages of computation.
v. Usage of most appropriate and compact writable type for data
Big data new users or users switching from Hadoop Streaming to Java MapReduce often use the Text writable type unnecessarily. Although Text can be convenient, it’s inefficient to convert numeric data to and from UTF8 strings and can actually make up a significant portion of CPU time. Whenever dealing with non-textual data, consider using the binary Writables like IntWritable, FLoatwritable etc.
vi. Reusage of Writables
One of the common mistakes that many MapReduce users make is to allocate a new Writable object for every output from a mapper or reducer. For example, to implement a word-count mapper:
[php]public void map(…) {
…
for (String word : words) {
output.collect(new Text(word), new IntWritable(1));
}[/php]
This implementation causes allocation of thousands of short-lived objects. While Java garbage collector does a reasonable job at dealing with this, it is more efficient to write:
[php]class MyMapper … {
Text wordText = new Text();
IntWritable one = new IntWritable(1);
public void map(…) {
… for (String word : words)
{
wordText.set(word);
output.collect(word, one); }
}
}[/php]
This is also one of the Hadoop job optimizing technique while Data flows in MapReduce.
3. Conclusion
In conclusion of the Hadoop Optimization tutorial, we can say that there are various Job optimization techniques that help you in Job optimizing in MapReduce. Like using combiner between mapper and Reducer, by LZO compression usage, proper tuning of the number of MapReduce tasks, Reusage of writable. If you find any other MapReduce Job Optimization technique, so, please let us know by leaving a comment in a section below.
See Also-
Your opinion matters
Please write your valuable feedback about DataFlair on Google
Hello,
Can you please tell what is the -noatime in the first point.
Thanks