groupByKey vs reduceByKey in Apache Spark

Viewing 1 reply thread
  • Author
    Posts
    • #6040
      DataFlair Team
      Moderator

      What is the difference between groupByKey vs reduceByKey in Spark?
      Which of groupByKey and reduceByKey is transformation and which is action?
      While processing RDD which is better groupByKey or reduceByKey?

    • #6045
      DataFlair Team
      Moderator

      On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.

      Spark provides the provision to save data to disk when there is more data shuffling onto a single executor machine than can fit in memory.

      Example:

      val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)
      val group = data.groupByKey().collect()
      group.foreach(println)

      On applying reduceByKey on a dataset (K, V), before shuffeling of data the pairs on the same machine with the same key are combined.

      Example:

      val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
      val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
      data.foreach(println)

      For more operation on Apache Spark read: RDD Transformation and Action.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.