Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › groupByKey vs reduceByKey in Apache Spark
- This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 5:00 pm #6040DataFlair TeamSpectator
What is the difference between groupByKey vs reduceByKey in Spark?
Which of groupByKey and reduceByKey is transformation and which is action?
While processing RDD which is better groupByKey or reduceByKey? -
September 20, 2018 at 5:00 pm #6045DataFlair TeamSpectator
On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.
Spark provides the provision to save data to disk when there is more data shuffling onto a single executor machine than can fit in memory.
Example:
val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3) val group = data.groupByKey().collect() group.foreach(println)
On applying reduceByKey on a dataset (K, V), before shuffeling of data the pairs on the same machine with the same key are combined.
Example:
val words = Array("one","two","two","four","five","six","six","eight","nine","ten") val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_) data.foreach(println)
For more operation on Apache Spark read: RDD Transformation and Action.
-
-
AuthorPosts
- You must be logged in to reply to this topic.