Explain GroupByKey() operation

Viewing 0 reply threads
  • Author
    Posts
    • #4710
      DataFlair TeamDataFlair Team
      Spectator

      > GroupByKey() is transformation which operate on pairRDD (which contains Key/Value).
      > PairRDD contains tuple, hence we need to pass the function that operator on tuple instead of each element.
      > Its Group values with the same Key in new RDD.
      > It is a wide operation because it shuffles data across multiple partition
      From :
      http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#29_GroupBy
      It works on key value pair, returns a new dataset of grouped items. It will return the new RDD which is made up with key (which is a group) and list of items of that group. Order of elements within group may not be the same when you apply same operation on same RDD over and over. It’s a wide operation as it shuffles data from multiple partitions / divisions and create another RDD.

      Example :

      val rdd1 = sc.parallelize(Seq(5,10),(5,15),(4,8),(4,12),(5,20),(10,50)))
      val rdd2 = rdd1.groupByKey()
      rdd2.collect()


      Output:
      Array[(Int, Iterable[Int])] = Array((4,CompactBuffer(8,12)), (10,CompactBuffer(50)), (5,CompactBuffer(10,15,20)))

Viewing 0 reply threads
  • You must be logged in to reply to this topic.