Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › Explain the mapPartitions() and mapPartitionsWithIndex()
- This topic has 4 replies, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 4:06 pm #5739DataFlair TeamSpectator
Explain the mapPartitions() and mapPartitionsWithIndex()
-
September 20, 2018 at 4:06 pm #5742DataFlair TeamSpectator
mapPartitions() and mapPartitionsWithIndex() are both transformation.
mapPartitions() :
> mapPartitions() can be used as an alternative to map() and foreach() .
> mapPartitions() can be called for each partitions while map() and foreach() is called for each elements in an RDD
> Hence one can do the initialization on per-partition basis rather than each element basis
mapPartitions() :
> mapPartitionsWithIndex is similar to mapPartitions() but it provides second parameter index which keeps the track of partition.From :
http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#24_MapPartitions2.4. MapPartitions:
It runs one at a time on each partition or block of the Rdd, so function must be of type iterator<T>. It improves performance by reducing creation of object in map function.
2.5. MappartionwithIndex:It is similar to MapPartition but with one difference that it takes two parameters, the first parameter is the index and second is an iterator through all items within this partition (Int, Iterator<t>).
-
September 20, 2018 at 4:06 pm #5745DataFlair TeamSpectator
From :
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
mapPartitions() syntax
def mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
Permalink
Return a new RDD by applying a function to each partition of this RDD. -
September 20, 2018 at 4:07 pm #5752DataFlair TeamSpectator
From :
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
mapPartitionsWithIndex syntax
def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.
-
September 20, 2018 at 4:07 pm #5756DataFlair TeamSpectator
Mappartitions is a transformation that is similar to Map.
In Map, a function is applied to each and every element of an RDD and returns each and every other element of the resultant RDD. In the case of mapPartitions, instead of each element, the function is applied to each partition of RDD and returns multiple elements of the resultant RDD. In mapPartitions transformation, the performance is improved since the object creation is eliminated for each and every element as in map transformation.
Since the mapPartitions transformation works on each partition, it takes an iterator of string or int values as an input for a partition.
Consider the following example:
val data = sc.parallelize(List(1,2,3,4,5,6,7,8), 2) Map: def sumfuncmap(numbers : Int) : Int = { var sum = 1 return sum + numbers } data.map(sumfuncmap).collect returns Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9) //Applied to each and every element
MapPartitions:
def sumfuncpartition(numbers : Iterator[Int]) : Iterator[Int] = { var sum = 1 while(numbers.hasNext) { sum = sum + numbers.next() } return Iterator(sum) } data.mapPartitions(sumfuncpartition).collect returns Array[Int] = Array(11, 27) // Applied to each and every element partition-wise
MapPartitionsWithIndex is similar to mapPartitions, except that it takes one more argument as input, which is the index of the partition.
-
-
AuthorPosts
- You must be logged in to reply to this topic.