What is FlatMap in Apache Spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 3:29 pm #5511
  
  DataFlair Team
  Spectator
  
  What is FlatMap transformation operation in Apache Spark?
  What is the need for the FlatMap when we already have Map Operation?
  What processing can be done in the FlatMap in Spark explain with example?
- September 20, 2018 at 3:29 pm #5512
  
  DataFlair Team
  Spectator
  
  FlatMap is a transformation operation in Apache Sparkto create an RDD from existing RDD. It takes one element from an RDD and can produce 0, 1 or many outputs based on business logic. It is similar to Map operation, but Map produces one to one output. If we perform Map operation on an RDD of length N, output RDD will also be of length N. But for FlatMap operation output RDD can be of different length based on business logic
  
  X——A x———–a
  Y——B y———–b,c
  Z——C z———–d,e,f
  
  Map Operation FlatMap Operation
  
  We can also say as flatMap transforms an RDD of length N into a collection of N collection, then flattens into a single RDD of results.
  
  If we observe the below example data1 RDD which is the output of Map operation has same no of element as of data RDD,
  But data2 RDD does not have the same number of elements. We can also observe here as data2 RDD is a flattened output of data1 RDD
  
  pranshu@pranshu-virtual-machine:~$ cat pk.txt
  1 2 3 4
  5 6 7 8 9
  10 11 12
  13 14 15 16 17
  18 19 20
  
  scala> val data = sc.textFile(“/home/pranshu/pk.txt”)
  17/05/17 07:08:20 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
  data: org.apache.spark.rdd.RDD[String] = /home/pranshu/pk.txt MapPartitionsRDD[1] at textFile at <console>:24
  
  scala> data.collect
  res0: Array[String] = Array(1 2 3 4, 5 6 7 8 9, 10 11 12, 13 14 15 16 17, 18 19 20)
  
  scala>
  
  scala> val data1 = data.map(line => line.split(” “))
  data1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26
  
  scala>
  
  scala> val data2 = data.flatMap(line => line.split(” “))
  data2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:26
  
  scala>
  
  scala> data1.collect
  res1: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8, 9), Array(10, 11, 12), Array(13, 14, 15, 16, 17), Array(18, 19, 20))
  
  scala>
  
  scala> data2.collect
  res2: Array[String] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
  
  For more details, refer: http://data-flair.training/blogs/map-vs-flatmap-operation-in-apache-spark/
  
  For shell commands to process the data, refer: http://data-flair.training/blogs/apache-spark-shell-commands-beginners-tutorial/
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

What is FlatMap in Apache Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses