What is write ahead log(journaling) in Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #5840
      DataFlair Team
      Moderator

      Define journaling in Apache Spark.
      How is fault-tolerance achieved through the write-ahead log in Spark?

    • #5842
      DataFlair Team
      Moderator

      There are two types of failures in any Apache Spark job – Either the driver failure or the worker failure.

      When any worker node fails, the executor processes running in that worker node will be killed, and the tasks which were scheduled on that worker node will be automatically moved to any of the other running worker nodes, and the tasks will be accomplished.

      When the driver or master node fails, all of the associated worker nodes running the executors will be killed, along with the data in each of the executors’ memory. In the case of files being read from reliable and fault tolerant file systems like HDFS, zero data loss is always guaranteed, as the data is ready to be read anytime from the file system. Checkpointing also ensures fault tolerance in Spark by periodically saving the application data in specific intervals.

      In the case of Spark Streaming application, zero data loss is not always guaranteed, as the data will be buffered in the executors’ memory until they get processed. If the driver fails, all of the executors will be killed, with the data in their memory, and the data cannot be recovered.

      To overcome this data loss scenario, Write Ahead Logging (WAL) has been introduced in Apache Spark 1.2. With WAL enabled, the intention of the operation is first noted down in a log file, such that if the driver fails and is restarted, the noted operations in that log file can be applied to the data. For sources that read streaming data, like Kafka or Flume, receivers will be receiving the data, and those will be stored in the executor’s memory. With WAL enabled, these received data will also be stored in the log files.

      WAL can be enabled by performing the below:

      1. Setting the checkpoint directory, by using streamingContext.checkpoint(path)

      2. Enabling the WAL logging, by setting spark.stream.receiver.WriteAheadLog.enable to True.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.