> GroupByKey() is transformation which operate on pairRDD (which contains Key/Value).
> PairRDD contains tuple, hence we need to pass the function that operator on tuple instead of each element.
> Its Group values with the same Key in new RDD.
> It is a wide operation because it shuffles data across multiple partition
From :
http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#29_GroupBy
It works on key value pair, returns a new dataset of grouped items. It will return the new RDD which is made up with key (which is a group) and list of items of that group. Order of elements within group may not be the same when you apply same operation on same RDD over and over. It’s a wide operation as it shuffles data from multiple partitions / divisions and create another RDD.
Example :
val rdd1 = sc.parallelize(Seq(5,10),(5,15),(4,8),(4,12),(5,20),(10,50)))
val rdd2 = rdd1.groupByKey()
rdd2.collect()
Output:
Array[(Int, Iterable[Int])] = Array((4,CompactBuffer(8,12)), (10,CompactBuffer(50)), (5,CompactBuffer(10,15,20)))