Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Hadoop › How to attain fault tolerance in Spark?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 12:17 pm #4797DataFlair TeamSpectator
How is fault tolerance achieved in Apache Spark?
Is Apache Spark fault tolerant? if yes, how? -
September 20, 2018 at 12:17 pm #4798DataFlair TeamSpectator
Yes, Apache Spark offers fault tolerance because of its abstraction-RDD. Basically, as its job, it handles the failure of any worker node in the cluster. In this way, it makes sure that the loss of data is reduced to zero.
Basically, data in fault-tolerant file systems like HDFS or S3, Spark operates on it. But for streaming/live data (data over the network), this does not set true. Hence, fault tolerance in Spark is needed. The basic fault-tolerant semantic of Spark are:
As a process, every Spark RDD remembers the lineage of the input dataset, and it is possible because Apache Spark RDD has a feature of immutability.
Any partition can be re-computed from the original fault-tolerant dataset using the lineage of operations, if due to a worker node failure if any partition of an RDD is lost.
The data in the final transformed RDD will always be the same irrespective of failures in the Spark cluster, assuming that all of the RDD transformations are deterministic.
The achieved data is replicated among multiple Spark executors in worker nodes in the cluster, to achieve fault tolerance for all the generated RDDs. It came up as of two kinds of data that needs to be recovered in the event of failure:
– Data received and replicated
– Data received but buffered for replicationHowever, to learn about Fault tolerance in Apache Spark in detail, follow the link: Fault tolerance in Apache Spark – Reliable Spark Streaming
-
-
AuthorPosts
- You must be logged in to reply to this topic.