Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › Explain distnct(),union(),intersection() and substract() transformation in Spark
- This topic has 4 replies, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 9:26 pm #6371DataFlair TeamSpectator
distnct() transformation
- If one want only unique elements in a RDD in that case one can use d1.distnct() where d1 is RDD
Example
val d1 = sc.parallelize(List("c","c","p","m","t")) val result = d1.distnct() result.foreach{println}
OutPut:
p
t
m
cTo learn all transformation operations with Examples, refer link: Spark RDD Operations-Transformation & Action with Example
-
September 20, 2018 at 9:26 pm #6372DataFlair TeamSpectator
union() transformation
- Its simplest set operation.
- rdd1.union(rdd2) which outputs a RDD which contains the data from both sources.
- If the duplicates are present in the input RDD, output of union() transformation will contain duplicate also which can be fixed using distinct().
Example
val u1 = sc.parallelize(List("c","c","p","m","t")) val u2 = sc.parallelize(List("c","m","k")) val result = u1.union(u2) result.foreach{println}
Output:
c
c
p
m
t
c
m
k -
September 20, 2018 at 9:26 pm #6373DataFlair TeamSpectator
intersection() transformation
-
<li style=”list-style-type: none”>
- intersection(anotherrdd) returns the elements which are present in both the RDDs.
- intersection(anotherrdd) remove all the duplicate including duplicated in single RDD
val is1 = sc.parallelize(List("c","c","p","m","t")) val is2 = sc.parallelize(List("c","m","k")) val result = is1.union(is2) result.foreach{println}
Output :
m
c -
September 20, 2018 at 9:27 pm #6374DataFlair TeamSpectator
subtract() transformation
- Subtract(anotherrdd).
- It returns an RDD that has only value present in the first RDD and not in second RDD.
Example
val s1 = sc.parallelize(List("c","c","p","m","t")) val s2 = sc.parallelize(List("c","m","k")) val result = s1.subtract(s2) result.foreach{println}
Output:
t
pFor more transformation in Apache Spark refer to
Transformation and Action -
September 20, 2018 at 9:27 pm #6375DataFlair TeamSpectator
Adding one more point about distinct() transformation:
distinct() transformation is expensive operation as it requires shuffling all the data over the network to ensure that we receive only one copy of each element
-
-
AuthorPosts
- You must be logged in to reply to this topic.