groupByKey vs reduceByKey in Apache Spark

This topic has 1 reply, 1 voice, and was last updated 7 years, 10 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 5:00 pm #6040
  
  DataFlair Team
  Spectator
  
  What is the difference between groupByKey vs reduceByKey in Spark?
  Which of groupByKey and reduceByKey is transformation and which is action?
  While processing RDD which is better groupByKey or reduceByKey?
- September 20, 2018 at 5:00 pm #6045
  DataFlair Team
  Spectator
  On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.
  
  Spark provides the provision to save data to disk when there is more data shuffling onto a single executor machine than can fit in memory.
  
  Example:
```
val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)
val group = data.groupByKey().collect()
group.foreach(println)
```
  On applying reduceByKey on a dataset (K, V), before shuffeling of data the pairs on the same machine with the same key are combined.
  
  Example:
```
val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
data.foreach(println)
```
  For more operation on Apache Spark read: RDD Transformation and Action.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.