Explain the mapPartitions() and mapPartitionsWithIndex()

This topic has 4 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 4 reply threads

Author

Posts
- September 20, 2018 at 4:06 pm #5739
  
  DataFlair Team
  Spectator
  
  Explain the mapPartitions() and mapPartitionsWithIndex()
- September 20, 2018 at 4:06 pm #5742
  
  DataFlair Team
  Spectator
  
  mapPartitions() and mapPartitionsWithIndex() are both transformation.
  
  mapPartitions() :
  
  > mapPartitions() can be used as an alternative to map() and foreach() .
  > mapPartitions() can be called for each partitions while map() and foreach() is called for each elements in an RDD
  > Hence one can do the initialization on per-partition basis rather than each element basis
  
  mapPartitions() :
  
  > mapPartitionsWithIndex is similar to mapPartitions() but it provides second parameter index which keeps the track of partition.
  
  From :
  http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#24_MapPartitions
  
  2.4. MapPartitions:
  
  It runs one at a time on each partition or block of the Rdd, so function must be of type iterator<T>. It improves performance by reducing creation of object in map function.
  2.5. MappartionwithIndex:
  
  It is similar to MapPartition but with one difference that it takes two parameters, the first parameter is the index and second is an iterator through all items within this partition (Int, Iterator<t>).
- September 20, 2018 at 4:06 pm #5745
  
  DataFlair Team
  Spectator
  
  From :
  http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
  
  mapPartitions() syntax
  
  def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
  Permalink
  Return a new RDD by applying a function to each partition of this RDD.
- September 20, 2018 at 4:07 pm #5752
  
  DataFlair Team
  Spectator
  
  From :
  http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
  
  mapPartitionsWithIndex syntax
  
  def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
  
  Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
- September 20, 2018 at 4:07 pm #5756
  DataFlair Team
  Spectator
  Mappartitions is a transformation that is similar to Map.
  
  In Map, a function is applied to each and every element of an RDD and returns each and every other element of the resultant RDD. In the case of mapPartitions, instead of each element, the function is applied to each partition of RDD and returns multiple elements of the resultant RDD. In mapPartitions transformation, the performance is improved since the object creation is eliminated for each and every element as in map transformation.
  
  Since the mapPartitions transformation works on each partition, it takes an iterator of string or int values as an input for a partition.
  
  Consider the following example:
```
val data = sc.parallelize(List(1,2,3,4,5,6,7,8), 2)

Map:

def sumfuncmap(numbers : Int) : Int =
{
var sum = 1

return sum + numbers
}

data.map(sumfuncmap).collect

returns Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9)         //Applied to each and every element
```
  MapPartitions:
```
def sumfuncpartition(numbers : Iterator[Int]) : Iterator[Int] =
{
var sum = 1
while(numbers.hasNext)
{
sum = sum + numbers.next()
}
return Iterator(sum)
}

data.mapPartitions(sumfuncpartition).collect

returns

Array[Int] = Array(11, 27)         // Applied to each and every element partition-wise
```
  MapPartitionsWithIndex is similar to mapPartitions, except that it takes one more argument as input, which is the index of the partition.
Author

Posts

Viewing 4 reply threads

You must be logged in to reply to this topic.

Explain the mapPartitions() and mapPartitionsWithIndex()

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses