Explain reduceByKey() operation

Viewing 2 reply threads
  • Author
    Posts
    • #5051
      DataFlair TeamDataFlair Team
      Spectator

      Explain reduceByKey() operation

    • #5053
      DataFlair TeamDataFlair Team
      Spectator

      reduceByKey() is transformation which operate on pairRDD (which contains Key/Value).
      > PairRDD contains tuple, hence we need to pass the function that operator on tuple instead of each element.
      > It merges the values with the same key using associative reduce function.
      > It is wide operation because data shuffles may happen across multiple partitions.
      > It merges data locally before sending data across partitions for optimize data shuffling.
      > It takes function as an input which has two parameter of the same type (values associated with same key) and one element output of the input type(value)
      > We can say that it has three overloaded functions :

      reduceBykey(function)
      reduceByKey(function, numberofpartition)
      reduceBykey(partitioner, function)

      From :
      http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#210_ReduceByKey
      It uses associative reduce function, where it merges value of each key. It can be used with Rdd only in key value pair. It’s wide operation which shuffles data from multiple partitions/divisions and creates another RDD. It merges data locally using associative function for optimized data shuffling. Result of the combination (e.g. a sum) is of the same type that the values, and that the operation when combined from different partitions is also the same as the operation when combining values inside a partition.

      Example :

      val rdd1 = sc.parallelize(Seq(5,10),(5,15),(4,8),(4,12),(5,20),(10,50)))
      val rdd2 = rdd1.reduceByKey((x,y)=>x+y)

      OR

      rdd2.collect()


      Output:
      Array[(Int, Int)] = Array((4,20),(10,50),(5,45))

    • #5054
      DataFlair TeamDataFlair Team
      Spectator

      OR part is missing, which is as follows:

      val rdd2 = rdd1.reduceByKey(_+_)

Viewing 2 reply threads
  • You must be logged in to reply to this topic.