Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › Explain reduceByKey() operation
- This topic has 2 replies, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 2:01 pm #5051DataFlair TeamSpectator
Explain reduceByKey() operation
-
September 20, 2018 at 2:01 pm #5053DataFlair TeamSpectator
reduceByKey() is transformation which operate on pairRDD (which contains Key/Value).
> PairRDD contains tuple, hence we need to pass the function that operator on tuple instead of each element.
> It merges the values with the same key using associative reduce function.
> It is wide operation because data shuffles may happen across multiple partitions.
> It merges data locally before sending data across partitions for optimize data shuffling.
> It takes function as an input which has two parameter of the same type (values associated with same key) and one element output of the input type(value)
> We can say that it has three overloaded functions :
reduceBykey(function)
reduceByKey(function, numberofpartition)
reduceBykey(partitioner, function)
From :
http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#210_ReduceByKey
It uses associative reduce function, where it merges value of each key. It can be used with Rdd only in key value pair. It’s wide operation which shuffles data from multiple partitions/divisions and creates another RDD. It merges data locally using associative function for optimized data shuffling. Result of the combination (e.g. a sum) is of the same type that the values, and that the operation when combined from different partitions is also the same as the operation when combining values inside a partition.Example :
val rdd1 = sc.parallelize(Seq(5,10),(5,15),(4,8),(4,12),(5,20),(10,50)))
val rdd2 = rdd1.reduceByKey((x,y)=>x+y)
OR
rdd2.collect()
Output:
Array[(Int, Int)] = Array((4,20),(10,50),(5,45)) -
September 20, 2018 at 2:01 pm #5054DataFlair TeamSpectator
OR part is missing, which is as follows:
val rdd2 = rdd1.reduceByKey(_+_)
-
-
AuthorPosts
- You must be logged in to reply to this topic.