Explain distnct(),union(),intersection() and substract() transformation in Spark

This topic has 4 replies, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.

Viewing 4 reply threads

Author

Posts
- September 20, 2018 at 9:26 pm #6371
  DataFlair Team
  Spectator
  distnct() transformation
  - If one want only unique elements in a RDD in that case one can use d1.distnct() where d1 is RDD
  Example
```
val d1 = sc.parallelize(List("c","c","p","m","t"))
val result = d1.distnct()
result.foreach{println}
```
  OutPut:
  p
  t
  m
  c
  
  To learn all transformation operations with Examples, refer link: Spark RDD Operations-Transformation & Action with Example
- September 20, 2018 at 9:26 pm #6372
  DataFlair Team
  Spectator
  union() transformation
  - Its simplest set operation.
  - rdd1.union(rdd2) which outputs a RDD which contains the data from both sources.
  - If the duplicates are present in the input RDD, output of union() transformation will contain duplicate also which can be fixed using distinct().
  Example
```
val u1 = sc.parallelize(List("c","c","p","m","t"))
val u2 = sc.parallelize(List("c","m","k"))
val result = u1.union(u2)
result.foreach{println}
```
  Output:
  c
  c
  p
  m
  t
  c
  m
  k
- September 20, 2018 at 9:26 pm #6373
  DataFlair Team
  Spectator
  intersection() transformation
  - intersection(anotherrdd) returns the elements which are present in both the RDDs.
  - intersection(anotherrdd) remove all the duplicate including duplicated in single RDD
```
val is1 = sc.parallelize(List("c","c","p","m","t"))
val is2 = sc.parallelize(List("c","m","k"))
val result = is1.union(is2)
result.foreach{println}
```
  Output :
  m
  c
- September 20, 2018 at 9:27 pm #6374
  DataFlair Team
  Spectator
  subtract() transformation
  - Subtract(anotherrdd).
  - It returns an RDD that has only value present in the first RDD and not in second RDD.
  Example
```
val s1 = sc.parallelize(List("c","c","p","m","t"))
val s2 = sc.parallelize(List("c","m","k"))
val result = s1.subtract(s2)
result.foreach{println}
```
  Output:
  t
  p
  
  For more transformation in Apache Spark refer to
  Transformation and Action
- September 20, 2018 at 9:27 pm #6375
  
  DataFlair Team
  Spectator
  
  Adding one more point about distinct() transformation:
  
  distinct() transformation is expensive operation as it requires shuffling all the data over the network to ensure that we receive only one copy of each element
Author

Posts

Viewing 4 reply threads

You must be logged in to reply to this topic.

Explain distnct(),union(),intersection() and substract() transformation in Spark

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses