How to attain fault tolerance in Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #4797
      DataFlair TeamDataFlair Team
      Spectator

      How is fault tolerance achieved in Apache Spark?
      Is Apache Spark fault tolerant? if yes, how?

    • #4798
      DataFlair TeamDataFlair Team
      Spectator

      Yes, Apache Spark offers fault tolerance because of its abstraction-RDD. Basically, as its job, it handles the failure of any worker node in the cluster. In this way, it makes sure that the loss of data is reduced to zero.

      Basically, data in fault-tolerant file systems like HDFS or S3, Spark operates on it. But for streaming/live data (data over the network), this does not set true. Hence, fault tolerance in Spark is needed. The basic fault-tolerant semantic of Spark are:

      As a process, every Spark RDD remembers the lineage of the input dataset, and it is possible because Apache Spark RDD has a feature of immutability.

      Any partition can be re-computed from the original fault-tolerant dataset using the lineage of operations, if due to a worker node failure if any partition of an RDD is lost.

      The data in the final transformed RDD will always be the same irrespective of failures in the Spark cluster, assuming that all of the RDD transformations are deterministic.

      The achieved data is replicated among multiple Spark executors in worker nodes in the cluster, to achieve fault tolerance for all the generated RDDs. It came up as of two kinds of data that needs to be recovered in the event of failure:

      – Data received and replicated
      – Data received but buffered for replication

      However, to learn about Fault tolerance in Apache Spark in detail, follow the link: Fault tolerance in Apache Spark – Reliable Spark Streaming

Viewing 1 reply thread
  • You must be logged in to reply to this topic.