Site icon DataFlair

Hadoop Optimization | Job Optimization & Performance Tuning

1. Hadoop Optimization Tutorial: Objective

This tutorial on Hadoop Optimization will explain you Hadoop cluster optimization or MapReduce job optimization techniques that would help you in optimizing MapReduce job performance to ensure the best performance for your Hadoop cluster.

Hadoop Optimization | Job Optimization & Performance Tuning

2. 6 Hadoop Optimization or Job Optimization Techniques

There are various ways to improve the Hadoop optimization. Let’s discuss each of them one by one-

i. Proper configuration of your cluster

ii. LZO compression usage

This is always a good idea for Intermediate data. Almost every Hadoop job that generates a non-negligible amount of map output will benefit from intermediate data compression with LZO. Although LZO adds a little bit of CPU overhead, it saves time by reducing the amount of disk IO during the shuffle.
In order to enable LZO compression set mapred.compress.map.output to true. This is one of the most important Hadoop optimization techniques.

iii. Proper tuning of the number of MapReduce tasks

iv. Combiner between mapper and reducer

If your algorithm involves computing aggregates of any sort, it is suggested to use a Combiner to perform some aggregation before the data hits the reducer. The MapReduce framework runs combine intelligently to reduce the amount of data to be written to disk and that has to be transferred between the Map and Reduce stages of computation.

v. Usage of most appropriate and compact writable type for data

Big data new users or users switching from Hadoop Streaming to Java MapReduce often use the Text writable type unnecessarily. Although Text can be convenient, it’s inefficient to convert numeric data to and from UTF8 strings and can actually make up a significant portion of CPU time. Whenever dealing with non-textual data, consider using the binary Writables like IntWritable, FLoatwritable etc.

vi. Reusage of Writables

One of the common mistakes that many MapReduce users make is to allocate a new Writable object for every output from a mapper or reducer. For example, to implement a word-count mapper:
[php]public void map(…) {

for (String word : words) {
output.collect(new Text(word), new IntWritable(1));
}[/php]
This implementation causes allocation of thousands of short-lived objects. While Java garbage collector does a reasonable job at dealing with this, it is more efficient to write:
[php]class MyMapper … {
Text wordText = new Text();
IntWritable one = new IntWritable(1);
public void map(…) {
… for (String word : words)
{
wordText.set(word);
output.collect(word, one); }
}
}[/php]
This is also one of the Hadoop job optimizing technique while Data flows in MapReduce.

3. Conclusion

In conclusion of the Hadoop Optimization tutorial, we can say that there are various Job optimization techniques that help you in Job optimizing in MapReduce. Like using combiner between mapper and Reducer, by LZO compression usage, proper tuning of the number of MapReduce tasks, Reusage of writable. If you find any other MapReduce Job Optimization technique, so, please let us know by leaving a comment in a section below.
See Also-

Exit mobile version