Explain distnct(),union(),intersection() and substract() transformation in Spark

Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Apache Spark Explain distnct(),union(),intersection() and substract() transformation in Spark

Viewing 4 reply threads
  • Author
    Posts
    • #6371
      DataFlair Team
      Moderator

      distnct() transformation

      • If one want only unique elements in a RDD in that case one can use d1.distnct() where d1 is RDD

      Example

      val d1 = sc.parallelize(List("c","c","p","m","t"))
      val result = d1.distnct()
      result.foreach{println}

      OutPut:
      p
      t
      m
      c

      To learn all transformation operations with Examples, refer link: Spark RDD Operations-Transformation & Action with Example

    • #6372
      DataFlair Team
      Moderator

      union() transformation

      • Its simplest set operation.
      • rdd1.union(rdd2) which outputs a RDD which contains the data from both sources.
      • If the duplicates are present in the input RDD, output of union() transformation will contain duplicate also which can be fixed using distinct().

      Example

      val u1 = sc.parallelize(List("c","c","p","m","t"))
      val u2 = sc.parallelize(List("c","m","k"))
      val result = u1.union(u2)
      result.foreach{println}

      Output:
      c
      c
      p
      m
      t
      c
      m
      k

    • #6373
      DataFlair Team
      Moderator

      intersection() transformation

        <li style=”list-style-type: none”>
      • intersection(anotherrdd) returns the elements which are present in both the RDDs.
      • intersection(anotherrdd) remove all the duplicate including duplicated in single RDD
      val is1 = sc.parallelize(List("c","c","p","m","t"))
      val is2 = sc.parallelize(List("c","m","k"))
      val result = is1.union(is2)
      result.foreach{println}

      Output :
      m
      c

    • #6374
      DataFlair Team
      Moderator

      subtract() transformation

      • Subtract(anotherrdd).
      • It returns an RDD that has only value present in the first RDD and not in second RDD.

      Example

      val s1 = sc.parallelize(List("c","c","p","m","t"))
      val s2 = sc.parallelize(List("c","m","k"))
      val result = s1.subtract(s2)
      result.foreach{println}

      Output:
      t
      p

      For more transformation in Apache Spark refer to
      Transformation and Action

    • #6375
      DataFlair Team
      Moderator

      Adding one more point about distinct() transformation:

      distinct() transformation is expensive operation as it requires shuffling all the data over the network to ensure that we receive only one copy of each element

Viewing 4 reply threads
  • You must be logged in to reply to this topic.