groupByKey vs reduceByKey in Apache Spark

      What is the difference between groupByKey vs reduceByKey in Spark?
      Which of groupByKey and reduceByKey is transformation and which is action?
      While processing RDD which is better groupByKey or reduceByKey?

      On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.

      Spark provides the provision to save data to disk when there is more data shuffling onto a single executor machine than can fit in memory.


      val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)
      val group = data.groupByKey().collect()

      On applying reduceByKey on a dataset (K, V), before shuffeling of data the pairs on the same machine with the same key are combined.


      val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
      val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)

      For more operation on Apache Spark read: RDD Transformation and Action.

