How to optimize MapReduce Job ?

Viewing 3 reply threads
  • Author
    Posts
    • #5868
      DataFlair TeamDataFlair Team
      Spectator

      My MapReduce is taking very long time to finish. What is the mechanism to optimize the MapReduce job?

    • #5870
      DataFlair TeamDataFlair Team
      Spectator

      There cannot be a specific answer to the question but one can try following things to optimize MapReduce jobs:

      Try to write Combiner along with a map and reduce functions if possible. This will reduce shuffling and the map-reduce job can be optimized.
      For huge files in TB, one can try keeping block size of 256MB or even 512MB.
      One can use specific Writable types while sending more than one values for a given key instead of appending values to a buffer and sending.
      One should keep the number of mappers & reducers to the reasonable value. The start of mapper or reducer process involves following things: start JVM (JVM loaded into the memory), initialize JVM and after processing (mapper/reducer) de-initialize JVM. All these JVM tasks are costly.
      Now, let us consider a case if mapper runs just for 20-30 seconds and for this we have to start/initialize/stop JVM, which might take a considerable amount of time. It is recommended to run the task for at least 1 minute.
      Compress map output. Compress the intermediate output, which reduces the amount of data need to shuffle between Mapper node and reducer node.<l/li>
      To learn more about, how to optimize MapReduce job follow: MapReduce Job Optimization

    • #5874
      DataFlair TeamDataFlair Team
      Spectator

      In Hadoop, there are many ways to optimize the jobs to make them run faster for your data. But the following are the thumb rules that should be followed before spending more time on it:
      Number of mappers :
      How long are you mappers running for? If they are only running for a few seconds on average, then you should see if there’s a way to have fewer mappers and make them all run longer, a minute or so, as a rule of thumb. The extent to which this is possible depends on the input format you are using.
      Number of reducers:
      The number of reducers should be slightly less than the number of reduce slots in the cluster to achieve maximum performance. This will permit the reducers to finish in one wave and fully utilizes the cluster during the reduce phase.
      Combiner:
      MapRedeuce job takes advantage of a combiner to reduce the amount of data in passing through the shuffle
      Intermediate compression:
      Custom serialization:
      If you are using your own custom Writable objects, or custom comparators, then make sure you have implemented RawComparator.

      Follow the link for more detail: MapReduce Job Optimization

    • #5876
      DataFlair TeamDataFlair Team
      Spectator

      The following actions can help optimizing the MapReduce jobs :

      Combiner: Using combiner will reduce the amount of data transferred to each of the the reducers, since combiner merges the output on the mapper side.

      Number of reducers: Choose optimal number of reducers. If data size is huge, then one reducer is not a good idea. Also, setting the number of reducers to a high number, is not a good idea, since the number of reducers also determines the number of partitions on the mapper side.

      Compress Mapper Output: Its recommended to compress the mapper outputs (determined by configuration:
      mapreduce.map.output.compress

      so that lesser data gets written to disk and gets transferred to reducers.

      Follow the link for more detail: MapReduce Job Optimization

Viewing 3 reply threads
  • You must be logged in to reply to this topic.