What is lineage in Apache spark?

Viewing 1 reply thread
  • Author
    Posts
    • #5664
      DataFlair TeamDataFlair Team
      Spectator

      What is lineage in Spark? How it make spark fault tolerant?

    • #5668
      DataFlair TeamDataFlair Team
      Spectator

      Whenever a series of transformations are performed on an RDD, they are not evaluated immediately, but lazily.

      When a new RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs will be logged in a graph, rather than the actual data. This graph is called the lineage graph.

      For eg., consider the below operations:

      1. Create a new RDD from a text file – first RDD
      2. Apply map operation on first RDD to get second RDD
      3. Apply filter operation on second RDD to get third RDD
      4. Apply count operation on third RDD to get fourth RDD

      Lineage graph of all these operations looks like:

      First RDD —> Second RDD (applying map) —> Third RDD (applying filter) —> Fourth RDD (applying count)

      This lineage graph will be useful in case if any of the partitions are lost. Spark can replay the transformation on that partition using the lineage graph existing in DAG (Directed Acyclic Graph) to achieve the same computation, rather than replicating the data cross different nodes as in HDFS.

      For information on Directed Acyclic Graph refer to:DAG in Apache Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.