How to optimize MapReduce Job ?

This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 4:28 pm #5868
  
  DataFlair Team
  Spectator
  
  My MapReduce is taking very long time to finish. What is the mechanism to optimize the MapReduce job?
- September 20, 2018 at 4:29 pm #5870
  
  DataFlair Team
  Spectator
  
  There cannot be a specific answer to the question but one can try following things to optimize MapReduce jobs:
  
  Try to write Combiner along with a map and reduce functions if possible. This will reduce shuffling and the map-reduce job can be optimized.
  For huge files in TB, one can try keeping block size of 256MB or even 512MB.
  One can use specific Writable types while sending more than one values for a given key instead of appending values to a buffer and sending.
  One should keep the number of mappers & reducers to the reasonable value. The start of mapper or reducer process involves following things: start JVM (JVM loaded into the memory), initialize JVM and after processing (mapper/reducer) de-initialize JVM. All these JVM tasks are costly.
  Now, let us consider a case if mapper runs just for 20-30 seconds and for this we have to start/initialize/stop JVM, which might take a considerable amount of time. It is recommended to run the task for at least 1 minute.
  Compress map output. Compress the intermediate output, which reduces the amount of data need to shuffle between Mapper node and reducer node.<l/li>
  To learn more about, how to optimize MapReduce job follow: MapReduce Job Optimization
- September 20, 2018 at 4:29 pm #5874
  
  DataFlair Team
  Spectator
  
  In Hadoop, there are many ways to optimize the jobs to make them run faster for your data. But the following are the thumb rules that should be followed before spending more time on it:
  Number of mappers :
  How long are you mappers running for? If they are only running for a few seconds on average, then you should see if there’s a way to have fewer mappers and make them all run longer, a minute or so, as a rule of thumb. The extent to which this is possible depends on the input format you are using.
  Number of reducers:
  The number of reducers should be slightly less than the number of reduce slots in the cluster to achieve maximum performance. This will permit the reducers to finish in one wave and fully utilizes the cluster during the reduce phase.
  Combiner:
  MapRedeuce job takes advantage of a combiner to reduce the amount of data in passing through the shuffle
  Intermediate compression:
  Custom serialization:
  If you are using your own custom Writable objects, or custom comparators, then make sure you have implemented RawComparator.
  
  Follow the link for more detail: MapReduce Job Optimization
- September 20, 2018 at 4:29 pm #5876
  
  DataFlair Team
  Spectator
  
  The following actions can help optimizing the MapReduce jobs :
  
  Combiner: Using combiner will reduce the amount of data transferred to each of the the reducers, since combiner merges the output on the mapper side.
  
  Number of reducers: Choose optimal number of reducers. If data size is huge, then one reducer is not a good idea. Also, setting the number of reducers to a high number, is not a good idea, since the number of reducers also determines the number of partitions on the mapper side.
  
  Compress Mapper Output: Its recommended to compress the mapper outputs (determined by configuration:
  mapreduce.map.output.compress
  
  so that lesser data gets written to disk and gets transferred to reducers.
  
  Follow the link for more detail: MapReduce Job Optimization
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

How to optimize MapReduce Job ?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses