Explain the repartition() operation

Viewing 1 reply thread
  • Author
    • #5151
      DataFlair Team

      Explain the repartition() operation

    • #5162
      DataFlair Team

      > repartition() is a transformation.
      > This function changes the number of partitions mentioned in parameter numPartitions(numPartitions : Int)
      > It’s in package org.apache.spark.rdd.ShuffledRDD

      def repartition(numPartitions: Int)(implicit ord: Ordering[(K, C)] = null): RDD[(K, C)]

      Return a new RDD that has exactly numPartitions partitions.
      Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.
      If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

      From :
      Repartition will reshuffle the data in your RDD to produce the final number of partitions you request. it may reduce or increase the number of partitions and shuffles data all over the network.

      Example :

      val rdd1 = sc.parallelize(1 to 100, 3)
      val rdd2 = rdd1.repartition(6)

      Output :
      Int = 3
      Int = 6

Viewing 1 reply thread
  • You must be logged in to reply to this topic.