Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › What is FlatMap in Apache Spark?
- This topic has 1 reply, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 3:29 pm #5511DataFlair TeamSpectator
What is FlatMap transformation operation in Apache Spark?
What is the need for the FlatMap when we already have Map Operation?
What processing can be done in the FlatMap in Spark explain with example? -
September 20, 2018 at 3:29 pm #5512DataFlair TeamSpectator
FlatMap is a transformation operation in Apache Sparkto create an RDD from existing RDD. It takes one element from an RDD and can produce 0, 1 or many outputs based on business logic. It is similar to Map operation, but Map produces one to one output. If we perform Map operation on an RDD of length N, output RDD will also be of length N. But for FlatMap operation output RDD can be of different length based on business logic
X——A x———–a
Y——B y———–b,c
Z——C z———–d,e,fMap Operation FlatMap Operation
We can also say as flatMap transforms an RDD of length N into a collection of N collection, then flattens into a single RDD of results.
If we observe the below example data1 RDD which is the output of Map operation has same no of element as of data RDD,
But data2 RDD does not have the same number of elements. We can also observe here as data2 RDD is a flattened output of data1 RDDpranshu@pranshu-virtual-machine:~$ cat pk.txt
1 2 3 4
5 6 7 8 9
10 11 12
13 14 15 16 17
18 19 20scala> val data = sc.textFile(“/home/pranshu/pk.txt”)
17/05/17 07:08:20 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
data: org.apache.spark.rdd.RDD[String] = /home/pranshu/pk.txt MapPartitionsRDD[1] at textFile at <console>:24scala> data.collect
res0: Array[String] = Array(1 2 3 4, 5 6 7 8 9, 10 11 12, 13 14 15 16 17, 18 19 20)scala>
scala> val data1 = data.map(line => line.split(” “))
data1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26scala>
scala> val data2 = data.flatMap(line => line.split(” “))
data2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:26scala>
scala> data1.collect
res1: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8, 9), Array(10, 11, 12), Array(13, 14, 15, 16, 17), Array(18, 19, 20))scala>
scala> data2.collect
res2: Array[String] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)For more details, refer: http://data-flair.training/blogs/map-vs-flatmap-operation-in-apache-spark/
For shell commands to process the data, refer: http://data-flair.training/blogs/apache-spark-shell-commands-beginners-tutorial/
-
-
AuthorPosts
- You must be logged in to reply to this topic.