Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › What is lineage in Apache spark?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 3:54 pm #5664DataFlair TeamSpectator
What is lineage in Spark? How it make spark fault tolerant?
-
September 20, 2018 at 3:54 pm #5668DataFlair TeamSpectator
Whenever a series of transformations are performed on an RDD, they are not evaluated immediately, but lazily.
When a new RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs will be logged in a graph, rather than the actual data. This graph is called the lineage graph.
For eg., consider the below operations:
1. Create a new RDD from a text file – first RDD
2. Apply map operation on first RDD to get second RDD
3. Apply filter operation on second RDD to get third RDD
4. Apply count operation on third RDD to get fourth RDDLineage graph of all these operations looks like:
First RDD —> Second RDD (applying map) —> Third RDD (applying filter) —> Fourth RDD (applying count)
This lineage graph will be useful in case if any of the partitions are lost. Spark can replay the transformation on that partition using the lineage graph existing in DAG (Directed Acyclic Graph) to achieve the same computation, rather than replicating the data cross different nodes as in HDFS.
For information on Directed Acyclic Graph refer to:DAG in Apache Spark
-
-
AuthorPosts
- You must be logged in to reply to this topic.