Explain GroupByKey() operation

This topic has 0 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 0 reply threads

Author

Posts
- September 20, 2018 at 11:53 am #4710
  DataFlair Team
  Spectator
  > GroupByKey() is transformation which operate on pairRDD (which contains Key/Value).
  > PairRDD contains tuple, hence we need to pass the function that operator on tuple instead of each element.
  > Its Group values with the same Key in new RDD.
  > It is a wide operation because it shuffles data across multiple partition
  From :
  http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#29_GroupBy
  It works on key value pair, returns a new dataset of grouped items. It will return the new RDD which is made up with key (which is a group) and list of items of that group. Order of elements within group may not be the same when you apply same operation on same RDD over and over. It’s a wide operation as it shuffles data from multiple partitions / divisions and create another RDD.
  
  Example :
```
val rdd1 = sc.parallelize(Seq(5,10),(5,15),(4,8),(4,12),(5,20),(10,50)))
val rdd2 = rdd1.groupByKey()
rdd2.collect()
```
  Output:
  Array[(Int, Iterable[Int])] = Array((4,CompactBuffer(8,12)), (10,CompactBuffer(50)), (5,CompactBuffer(10,15,20)))
Author

Posts

Viewing 0 reply threads

You must be logged in to reply to this topic.