What is lineage graph in Apache Spark?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 2:00 pm #5047
  
  DataFlair Team
  Spectator
  
  Explain Apache Spark Lineage Graph.
  What is RDD Lineage in Spark?
- September 20, 2018 at 2:00 pm #5048
  
  DataFlair Team
  Spectator
  
  When we apply a different transformation on RDD it creates RDD Linage graph. It is a new RDD from already existing RDDs. It is the dependencies graph between the existing and the new RDD formed. the need of RDD lineage graph arrives when we want to compute new RDD or if we want to recover the lost data from the lost persisted RDD.
  
  For more information study DAG in Spark
- September 20, 2018 at 2:00 pm #5049
  
  DataFlair Team
  Spectator
  
  Adding few more points on lineage graph:
  You can check lineage between two RDDs using rdd0.toDebugString. This gives back you the lineage graph from current rdd to all the previous dependencies of RDDs. See below. Whenever you see “+-” symbol from the toDebugString output, it means there will be next stage from the next operation onwards. This is indicates to identify that how many stage are created.
  
  scala> val rdd0 = sc.parallelize(List(“Ashok Vengala”,”Ashok Vengala”,”DataFlair”))
  rdd0: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[10] at parallelize at <console>:31
  
  scala> val count = rdd0.flatMap(rec => rec.split(” “)).map(word => (word,1)).reduceByKey(_+_)
  count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[13] at reduceByKey at <console>:33
  
  scala> count.toDebugString
  res24: String =
  (2) ShuffledRDD[13] at reduceByKey at <console>:33 []
  +-(2) MapPartitionsRDD[12] at map at <console>:33 []
  | MapPartitionsRDD[11] at flatMap at <console>:33 []
  | ParallelCollectionRDD[10] at parallelize at <console>:31 []
  
  From down to up (i.e, last three rows): These will be performed in stage-0. And the first row(ShuffledRDD): this will operation will be performed in stage-1.
  
  In toDebugString output, we are seeing something like ParallelCollectionRDD, MapPartitionsRDD and ShuffleRDD. These are all implementation of RDD abstract class.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What is lineage graph in Apache Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses