Adding few more points on lineage graph:
You can check lineage between two RDDs using rdd0.toDebugString. This gives back you the lineage graph from current rdd to all the previous dependencies of RDDs. See below. Whenever you see “+-” symbol from the toDebugString output, it means there will be next stage from the next operation onwards. This is indicates to identify that how many stage are created.
scala> val rdd0 = sc.parallelize(List(“Ashok Vengala”,”Ashok Vengala”,”DataFlair”))
rdd0: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[10] at parallelize at <console>:31
scala> val count = rdd0.flatMap(rec => rec.split(” “)).map(word => (word,1)).reduceByKey(_+_)
count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[13] at reduceByKey at <console>:33
scala> count.toDebugString
res24: String =
(2) ShuffledRDD[13] at reduceByKey at <console>:33 []
+-(2) MapPartitionsRDD[12] at map at <console>:33 []
| MapPartitionsRDD[11] at flatMap at <console>:33 []
| ParallelCollectionRDD[10] at parallelize at <console>:31 []
From down to up (i.e, last three rows): These will be performed in stage-0. And the first row(ShuffledRDD): this will operation will be performed in stage-1.
In toDebugString output, we are seeing something like ParallelCollectionRDD, MapPartitionsRDD and ShuffleRDD. These are all implementation of RDD abstract class.