RDD lineage in Spark: ToDebugString Method
Basically, in Spark all the dependencies between the RDDs will be logged in a graph, despite the actual data. This is what we call as a lineage graph in Spark. This document holds the concept of RDD lineage in Spark logical execution plan. Moreover, we will get to know that how to get RDD Lineage Graph by the toDebugString method in detail. Before all, let’s also learn about Spark RDDs.
2. Introduction to Spark RDD
Spark RDD is nothing but an acronym for “Resilient Distributed Dataset”. We can consider RDD as a fundamental data structure of Apache Spark. To be very specific, RDD is an immutable collection of objects in Apache Spark. That helps to compute on the different node of the cluster.
On decomposing the name of Spark RDD:
This means fault-tolerant. By using RDD lineage graph(DAG), we can recompute missing or damaged partitions due to node failures.
It means data resides on multiple nodes.
It is nothing but a record of the data you work with. Also, a user can load the dataset externally. For example, JSON file, CSV file, text file or database via JDBC with no specific data structure.
3. Introduction to RDD Lineage
Basically, evaluation of RDD is lazy in nature. It means a series of transformations are performed on an RDD, which is not even evaluated immediately.
While we create a new RDD from an existing Spark RDD, that new RDD also carries a pointer to the parent RDD in Spark. That is the same as all the dependencies between the RDDs those are logged in a graph, rather than the actual data. It is what we call as lineage graph.
RDD lineage is nothing but the graph of all the parent RDDs of an RDD. We also call it an RDD operator graph or RDD dependency graph. To be very specific, it is an output of applying transformations to the spark. Then, it creates a logical execution plan.
Also, physical execution plan or execution DAG is known as DAG of stages.
Let’s start with one example of Spark RDD lineage by using Cartesian or zip to understand well. However, we can also use other operators to build an RDD graph in Spark.
Above figure depicts an RDD graph, which is the result of the following series of transformations:
Let us revise Lazy evaluation in Spark
val r00 = sc.parallelize(0 to 9)
val r01 = sc.parallelize(0 to 90 by 10)
val r10 = r00 cartesian df01
val r11 = r00.map(n => (n, n))
val r12 = r00 zip df01
val r13 = r01.keyBy(_ / 20)
val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)
After an action has been called, this is a graph of what transformations need to be executed.
In other words, whenever on the basis of the existing RDDs we create new RDDs, using lineage graph spark manage these dependencies. Basically, along with metadata about what type of relationship it has with the parent RDD, each RDD maintains a pointer to one or more parent.
if we say, on an
RDD val b=a.map().
Hence, RDD b keeps a reference to its parent RDD a. That is a sort of an RDD lineage.
4. Logical Execution Plan for RDD Lineage
Basically, logical execution plan gets initiated with earliest RDDs. Earliest RDDs are nothing but RDDs which are not dependent on other RDDs. To be very specific those are independent of reference cached data. Moreover, it ends with the RDD those produces the result of the action which has been called to execute.
We can also say, it is a DAG that is executed when SparkContext is requested to run a Spark job.
5. ToDebugString Method to get RDD Lineage Graph in Spark
Although there are several methods to get RDD lineage graph in spark, one of the methods is toDebugString method. Such as,
Have a look at Spark DStream
Basically, we can learn about an Spark RDD lineage graph with the help of this method.
scala> val wordCount1 = sc.textFile(“README.md”).flatMap(_.split(“\\s+”)).map((_, 1)).reduceByKey(_ + _)
wordCount1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at <console>:24
res13: String =
(2) ShuffledRDD at reduceByKey at <console>:24  +-(2) MapPartitionsRDD at map at <console>:24  | MapPartitionsRDD at flatMap at <console>:24  | README.md MapPartitionsRDD at textFile at <console>:24  | README.md HadoopRDD at textFile at <console>:24  Here for indication of shuffle boundary, this method “ toDebugString method” uses indentations.
Basically, here H in round brackets refers, numbers that show the level of parallelism at each stage.
For example, (2) in the above output.
res14: Int = 2
The toDebugString method is included when executing an action, With spark.logLineage property enabled.
$ ./bin/spark-shell –conf spark.logLineage=true
scala> sc.textFile(“README.md”, 4).count
15/10/17 14:46:42 INFO SparkContext: Starting job: count at <console>:25
15/10/17 14:46:42 INFO SparkContext: RDD’s recursive dependencies:
(4) MapPartitionsRDD at textFile at <console>:25  | README.md HadoopRDD at textFile at <console>:25 
So, this was all about Spark RDD Lineage Tutorial. Hope you like our explanation.
Hence, by this blog, we have learned the actual meaning of Apache Spark RDD lineage graph. Moreover, also we have tasted the flavor of the logical execution plan in Apache Spark. However, we have also seen toDebugString method in detail. Therefore, we have covered all the concept of lineage graph in Apache Spark RDD.
Furthermore, if you have any query, please ask in the comment section.
Refer top books to learn Spark.