Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › Explain the repartition() operation
- This topic has 1 reply, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 2:24 pm #5151DataFlair TeamSpectator
Explain the repartition() operation
-
September 20, 2018 at 2:26 pm #5162DataFlair TeamSpectator
> repartition() is a transformation.
> This function changes the number of partitions mentioned in parameter numPartitions(numPartitions : Int)
> It’s in package org.apache.spark.rdd.ShuffledRDD
def repartition(numPartitions: Int)(implicit ord: Ordering[(K, C)] = null): RDD[(K, C)]
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.
If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.From :
http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/
Repartition will reshuffle the data in your RDD to produce the final number of partitions you request. it may reduce or increase the number of partitions and shuffles data all over the network.Example :
val rdd1 = sc.parallelize(1 to 100, 3) rdd1.getNumPartitions val rdd2 = rdd1.repartition(6) rdd2.getNumPartitions
Output :
Int = 3
Int = 6
-
-
AuthorPosts
- You must be logged in to reply to this topic.