Explain the mapPartitions() and mapPartitionsWithIndex()

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Spark Explain the mapPartitions() and mapPartitionsWithIndex()

Viewing 4 reply threads
  • Author
    Posts
    • #5739
      DataFlair TeamDataFlair Team
      Spectator

      Explain the mapPartitions() and mapPartitionsWithIndex()

    • #5742
      DataFlair TeamDataFlair Team
      Spectator

      mapPartitions() and mapPartitionsWithIndex() are both transformation.


      mapPartitions() :

      > mapPartitions() can be used as an alternative to map() and foreach() .
      > mapPartitions() can be called for each partitions while map() and foreach() is called for each elements in an RDD
      > Hence one can do the initialization on per-partition basis rather than each element basis


      mapPartitions() :

      > mapPartitionsWithIndex is similar to mapPartitions() but it provides second parameter index which keeps the track of partition.

      From :
      http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#24_MapPartitions

      2.4. MapPartitions:

      It runs one at a time on each partition or block of the Rdd, so function must be of type iterator<T>. It improves performance by reducing creation of object in map function.
      2.5. MappartionwithIndex:

      It is similar to MapPartition but with one difference that it takes two parameters, the first parameter is the index and second is an iterator through all items within this partition (Int, Iterator<t>).

    • #5745
      DataFlair TeamDataFlair Team
      Spectator

      From :
      http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

      mapPartitions() syntax

      def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
      Permalink
      Return a new RDD by applying a function to each partition of this RDD.

    • #5752
      DataFlair TeamDataFlair Team
      Spectator

      From :
      http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD


      mapPartitionsWithIndex syntax

      def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]

      Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

    • #5756
      DataFlair TeamDataFlair Team
      Spectator

      Mappartitions is a transformation that is similar to Map.

      In Map, a function is applied to each and every element of an RDD and returns each and every other element of the resultant RDD. In the case of mapPartitions, instead of each element, the function is applied to each partition of RDD and returns multiple elements of the resultant RDD. In mapPartitions transformation, the performance is improved since the object creation is eliminated for each and every element as in map transformation.

      Since the mapPartitions transformation works on each partition, it takes an iterator of string or int values as an input for a partition.

      Consider the following example:

      val data = sc.parallelize(List(1,2,3,4,5,6,7,8), 2)
      
      Map:
      
      def sumfuncmap(numbers : Int) : Int =
      {
      var sum = 1
      
      return sum + numbers
      }
      
      data.map(sumfuncmap).collect
      
      returns Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9)         //Applied to each and every element

      MapPartitions:

      def sumfuncpartition(numbers : Iterator[Int]) : Iterator[Int] =
      {
      var sum = 1
      while(numbers.hasNext)
      {
      sum = sum + numbers.next()
      }
      return Iterator(sum)
      }
      
      data.mapPartitions(sumfuncpartition).collect
      
      returns
      
      Array[Int] = Array(11, 27)         // Applied to each and every element partition-wise

      MapPartitionsWithIndex is similar to mapPartitions, except that it takes one more argument as input, which is the index of the partition.

Viewing 4 reply threads
  • You must be logged in to reply to this topic.