Explain the repartition() operation

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 2:24 pm #5151
  
  DataFlair Team
  Spectator
  
  Explain the repartition() operation
- September 20, 2018 at 2:26 pm #5162
  DataFlair Team
  Spectator
  > repartition() is a transformation.
  > This function changes the number of partitions mentioned in parameter numPartitions(numPartitions : Int)
  > It’s in package org.apache.spark.rdd.ShuffledRDD
  
  def repartition(numPartitions: Int)(implicit ord: Ordering[(K, C)] = null): RDD[(K, C)]
  Return a new RDD that has exactly numPartitions partitions.
  Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.
  If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.
  
  From :
  http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/
  Repartition will reshuffle the data in your RDD to produce the final number of partitions you request. it may reduce or increase the number of partitions and shuffles data all over the network.
  
  Example :
```
val rdd1 = sc.parallelize(1 to 100, 3)
rdd1.getNumPartitions
val rdd2 = rdd1.repartition(6)
rdd2.getNumPartitions
```
  Output :
  Int = 3
  Int = 6
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.