How is fault tolerance achieved in Apache Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #5848
      DataFlair Team
      Moderator

      How to attain fault tolerance in Spark?
      Is Apache Spark fault tolerant? if yes, how?

    • #5850
      DataFlair Team
      Moderator

      The basic semantics of fault tolerance in Apache Spark is, all the Spark RDDs are immutable. It remembers the dependencies between every RDD involved in the operations, through the lineage graph created in the DAG, and in the event of any failure, Spark refers to the lineage graph to apply the same operations to perform the tasks.

      There are two types of failures – Worker or driver failure. In case if the worker fails, the executors in that worker node will be killed, along with the data in their memory. Using the lineage graph, those tasks will be accomplished in any other worker nodes. The data is also replicated to other worker nodes to achieve fault tolerance. There are two cases:

      1.Data received and replicated – Data is received from the source, and replicated across worker nodes. In the case of any failure, the data replication will help achieve fault tolerance.

      2.Data received but not yet replicated – Data is received from the source but buffered for replication. In the case of any failure, the data needs to be retrieved from the source.

      For stream inputs based on receivers, the fault tolerance is based on the type of receiver:

        <li style=”list-style-type: none”>
      • Reliable receiver – Once the data is received and replicated, an acknowledgment is sent to the source. In case if the receiver fails, the source will not receive acknowledgment for the received data. When the receiver is restarted, the source will resend the data to achieve fault tolerance.
      • Unreliable receiver – The received data will not be acknowledged to the source. In this case of any failure, the source will not know if the data has been received or not, and it will nor resend the data, so there is data loss.

      To overcome this data loss scenario, Write Ahead Logging (WAL) has been introduced in Apache Spark 1.2. With WAL enabled, the intention of the operation is first noted down in a log file, such that if the driver fails and is restarted, the noted operations in that log file can be applied to the data. For sources that read streaming data, like Kafka or Flume, receivers will be receiving the data, and those will be stored in the executor’s memory. With WAL enabled, these received data will also be stored in the log files.

      WAL can be enabled by performing the below:

      • Setting the checkpoint directory, by using streamingContext.checkpoint(path)
      • Enabling the WAL logging, by setting spark.stream.receiver.WriteAheadLog.enable to True.

      to know more about fault tolerance refer: Fault Tolerance in Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.