What is lineage in Apache spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 3:54 pm #5664
  
  DataFlair Team
  Spectator
  
  What is lineage in Spark? How it make spark fault tolerant?
- September 20, 2018 at 3:54 pm #5668
  
  DataFlair Team
  Spectator
  
  Whenever a series of transformations are performed on an RDD, they are not evaluated immediately, but lazily.
  
  When a new RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs will be logged in a graph, rather than the actual data. This graph is called the lineage graph.
  
  For eg., consider the below operations:
  
  1. Create a new RDD from a text file – first RDD
  2. Apply map operation on first RDD to get second RDD
  3. Apply filter operation on second RDD to get third RDD
  4. Apply count operation on third RDD to get fourth RDD
  
  Lineage graph of all these operations looks like:
  
  First RDD —> Second RDD (applying map) —> Third RDD (applying filter) —> Fourth RDD (applying count)
  
  This lineage graph will be useful in case if any of the partitions are lost. Spark can replay the transformation on that partition using the lineage graph existing in DAG (Directed Acyclic Graph) to achieve the same computation, rather than replicating the data cross different nodes as in HDFS.
  
  For information on Directed Acyclic Graph refer to:DAG in Apache Spark
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.