What is aggregate() function. How it can be used with spark RDD?

Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Apache Spark What is aggregate() function. How it can be used with spark RDD?

Viewing 1 reply thread
  • Author
    Posts
    • #6500
      DataFlair Team
      Moderator

      Anyone can please let you explain aggregate() action by example.

    • #6501
      DataFlair Team
      Moderator

      aggregate() function

      In order to get data type different from the input type, this function gives us the flexibility. Basically, to get the final result, the aggregate() takes two functions. Form first one, we combine the element from our RDD along with the accumulator, and from the second one, it combines the accumulator. Thus, we supply the initial zero value of the type which we want to return, in aggregate.

      For Example:

      scala> val inputrdd = sc.parallelize(
      List(
      (“maths”, 21),
      (“english”, 22),
      (“science”, 31)
      ),
      3
      )
      inputrdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[14] at parallelize at :21

      scala> inputrdd.partitions.size
      res18: Int = 3

      scala> val result = inputrdd.aggregate(3) (
      /*
      * This is a seqOp for merging T into a U
      * ie (String, Int) in into Int
      * (we take (String, Int) in ‘value’ & return Int)
      * Arguments :
      * acc : Reprsents the accumulated result
      * value : Represents the element in ‘inputrdd’
      * In our case this of type (String, Int)
      * Return value
      * We are returning an Int
      */
      (acc, value) => (acc + value._2),

      /*
      * This is a combOp for mergining two U’s
      * (ie 2 Int)
      */
      (acc1, acc2) => (acc1 + acc2)
      )
      result: Int = 86

      The result is calculated as follows,

      Partition 1 : Sum(all Elements) + 3 (Zero value)
      Partition 2 : Sum(all Elements) + 3 (Zero value)
      Partition 3 : Sum(all Elements) + 3 (Zero value)
      Result = Partition1 + Partition2 + Partition3 + 3(Zero value)

      So we get 21 + 22 + 31 + (4 * 3) = 86.

      Learn more Spark RDD Operations-Transformation & Action with Example, follow the link: Spark RDD Operations-Transformation & Action with Example

Viewing 1 reply thread
  • You must be logged in to reply to this topic.