Site icon DataFlair

Fault tolerance in Apache Spark – Reliable Spark Streaming

1. Objective

In this Spark fault tolerance tutorial, we will learn what do you mean by fault tolerance and how Apache Spark handles fault tolerance. We will see fault-tolerant stream processing with Spark Streaming and Spark RDD fault tolerance. We will also learn what is Spark Streaming write ahead log, Spark streaming driver failure, Spark streaming worker failure to understand how to achieve fault tolerance in Apache Spark.

Fault tolerance in Apache Spark – Reliable Spark Streaming

2. Introduction to Fault Tolerance in Apache Spark

Before we start with learning what is fault tolerance in Apache Spark, let us revise concepts of Apache Spark for beginners.
Now let’s understand what is fault and how Spark handles fault tolerance.
Fault refers to failure, thus fault tolerance in Apache Spark is the capability to operate and to recover loss after a failure occurs. If we want our system to be fault tolerant, it should be redundant because we require a redundant component to obtain the lost data. The faulty data recovers by redundant data.

3. Spark RDD Fault Tolerance

Let us firstly see how to create RDDs in Spark? Spark operates on data in fault-tolerant file systems like HDFS or S3. So all the RDDs generated from fault tolerant data is fault tolerant. But this does not set true for streaming/live data (data over the network). So the key need of fault tolerance in Spark is for this kind of data. The basic fault-tolerant semantic of Spark are:

To achieve fault tolerance for all the generated RDDs, the achieved data replicates among multiple Spark executors in worker nodes in the cluster. This results in two types of data that needs to recover in the event of failure:

Fault tolerance in Apache Spark – Reliable Spark Streaming

Failure also occurs in worker as well as driver nodes.

Apache Mesos helps in making the Spark master fault tolerant by maintaining the backup masters. It is open source software residing between the application layer and the operating system. It makes easier to deploy and manage applications in large-scale clustered environment.  Executors are relaunched if they fail. Post failure, executors are relaunched automatically and spark streaming does parallel recovery by recomputing Spark RDD’s on input data. Receivers are restarted by the workers when they fail.

4. Fault Tolerance with Receiver-based sources

For input sources based on receivers, the fault tolerance depends on both- the failure scenario and the type of receiver. There are two types of receiver:

If the worker node fails, and the receiver is reliable there will be no data loss. But in the case of unreliable receiver data loss will occur. With the unreliable receiver, data received but not replicated can be lost.

5. Spark Streaming write ahead logs

If the driver node fails, all the data that was received and replicated in memory will be lost. This will affect the result of the stateful transformation. To avoid the loss of data, Spark 1.2 introduced write ahead logs, which save received data to fault-tolerant storage. All the data received is written to write ahead logs before it can be processed to Spark Streaming.
Write ahead logs are used in database and file system. It ensure the durability of any data operations. It works in the way that first the intention of the operation is written down in the durable log. After this, the operation is applied to the data. This is done because if the system fails in the middle of applying the operation, the lost data can be recovered. It is done by reading the log and reapplying the data it has intended to do.

Deployment Scenario Worker Failure Driver Failure
Spark 1.1 or earlier Buffered data lost with unreliable receivers Buffered data lost with the unreliable receiver.
Spark 1.2 or later without write ahead logs Zero data loss with reliable receivers, At-least-once semantics Past data lost with all receivers, Undefined semantics
Spark 1.2 or later with write ahead logs Zero data loss with reliable receivers, At-least-once semantics Zero data loss with reliable receivers and files, At-least-once semantics

6. High Availability

We can say any system highly available if its downtime is tolerable. This time depends on how critical the application is. Zero down time is an imaginary term for any system. Consider any machine has an uptime of 97.7%, so its probability to go down will be 0.023. If we have similar two machines, then the probability of both of them going down will be (0.023*0.023). in most high availability environment we have three machines in use, in that case, the probability of going down is (0.023*0.023*0.023) i.e. 0.000012167, which guarantees an uptime of system to be  99.9987833%  which is highly acceptable uptime guarantee. 6. Apache Spark High Availability

7. Conclusion

Hence we have studied Fault Tolerance in Apache Spark. I hope this blog helps you a lot to understand How Apache Spark is Fault Tolerant Framework. if Still you have doubt related to Fault Tolerance in Apache Spark, so leave a comment in a section given below.
See Also-

Exit mobile version