reduceByKey() is transformation which operate on pairRDD (which contains Key/Value).
> PairRDD contains tuple, hence we need to pass the function that operator on tuple instead of each element.
> It merges the values with the same key using associative reduce function.
> It is wide operation because data shuffles may happen across multiple partitions.
> It merges data locally before sending data across partitions for optimize data shuffling.
> It takes function as an input which has two parameter of the same type (values associated with same key) and one element output of the input type(value)
> We can say that it has three overloaded functions :
From : http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#210_ReduceByKey
It uses associative reduce function, where it merges value of each key. It can be used with Rdd only in key value pair. It’s wide operation which shuffles data from multiple partitions/divisions and creates another RDD. It merges data locally using associative function for optimized data shuffling. Result of the combination (e.g. a sum) is of the same type that the values, and that the operation when combined from different partitions is also the same as the operation when combining values inside a partition.
val rdd1 = sc.parallelize(Seq(5,10),(5,15),(4,8),(4,12),(5,20),(10,50)))
val rdd2 = rdd1.reduceByKey((x,y)=>x+y)