> repartition() is a transformation.
> This function changes the number of partitions mentioned in parameter numPartitions(numPartitions : Int)
> It’s in package org.apache.spark.rdd.ShuffledRDD
def repartition(numPartitions: Int)(implicit ord: Ordering[(K, C)] = null): RDD[(K, C)]
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.
If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.
From :
http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/
Repartition will reshuffle the data in your RDD to produce the final number of partitions you request. it may reduce or increase the number of partitions and shuffles data all over the network.
Example :
val rdd1 = sc.parallelize(1 to 100, 3)
rdd1.getNumPartitions
val rdd2 = rdd1.repartition(6)
rdd2.getNumPartitions
Output :
Int = 3
Int = 6