Explain the repartition() operation

Viewing 1 reply thread
  • Author
    Posts
    • #5151
      DataFlair TeamDataFlair Team
      Spectator

      Explain the repartition() operation

    • #5162
      DataFlair TeamDataFlair Team
      Spectator

      > repartition() is a transformation.
      > This function changes the number of partitions mentioned in parameter numPartitions(numPartitions : Int)
      > It’s in package org.apache.spark.rdd.ShuffledRDD


      def repartition(numPartitions: Int)(implicit ord: Ordering[(K, C)] = null): RDD[(K, C)]

      Return a new RDD that has exactly numPartitions partitions.
      Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data.
      If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

      From :
      http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/
      Repartition will reshuffle the data in your RDD to produce the final number of partitions you request. it may reduce or increase the number of partitions and shuffles data all over the network.

      Example :

      val rdd1 = sc.parallelize(1 to 100, 3)
      rdd1.getNumPartitions
      val rdd2 = rdd1.repartition(6)
      rdd2.getNumPartitions

      Output :
      Int = 3
      Int = 6

Viewing 1 reply thread
  • You must be logged in to reply to this topic.