What is FlatMap in Apache Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #5511
      DataFlair TeamDataFlair Team
      Spectator

      What is FlatMap transformation operation in Apache Spark?
      What is the need for the FlatMap when we already have Map Operation?
      What processing can be done in the FlatMap in Spark explain with example?

    • #5512
      DataFlair TeamDataFlair Team
      Spectator

      FlatMap is a transformation operation in Apache Sparkto create an RDD from existing RDD. It takes one element from an RDD and can produce 0, 1 or many outputs based on business logic. It is similar to Map operation, but Map produces one to one output. If we perform Map operation on an RDD of length N, output RDD will also be of length N. But for FlatMap operation output RDD can be of different length based on business logic

      X——A x———–a
      Y——B y———–b,c
      Z——C z———–d,e,f

      Map Operation FlatMap Operation

      We can also say as flatMap transforms an RDD of length N into a collection of N collection, then flattens into a single RDD of results.

      If we observe the below example data1 RDD which is the output of Map operation has same no of element as of data RDD,
      But data2 RDD does not have the same number of elements. We can also observe here as data2 RDD is a flattened output of data1 RDD

      pranshu@pranshu-virtual-machine:~$ cat pk.txt
      1 2 3 4
      5 6 7 8 9
      10 11 12
      13 14 15 16 17
      18 19 20

      scala> val data = sc.textFile(“/home/pranshu/pk.txt”)
      17/05/17 07:08:20 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
      data: org.apache.spark.rdd.RDD[String] = /home/pranshu/pk.txt MapPartitionsRDD[1] at textFile at <console>:24

      scala> data.collect
      res0: Array[String] = Array(1 2 3 4, 5 6 7 8 9, 10 11 12, 13 14 15 16 17, 18 19 20)

      scala>

      scala> val data1 = data.map(line => line.split(” “))
      data1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26

      scala>

      scala> val data2 = data.flatMap(line => line.split(” “))
      data2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:26

      scala>

      scala> data1.collect
      res1: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8, 9), Array(10, 11, 12), Array(13, 14, 15, 16, 17), Array(18, 19, 20))

      scala>

      scala> data2.collect
      res2: Array[String] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)

      For more details, refer: http://data-flair.training/blogs/map-vs-flatmap-operation-in-apache-spark/

      For shell commands to process the data, refer: http://data-flair.training/blogs/apache-spark-shell-commands-beginners-tutorial/

Viewing 1 reply thread
  • You must be logged in to reply to this topic.