Why Transformation is lazy in Spark ?

Viewing 1 reply thread
  • Author
    • #5624
      DataFlair TeamDataFlair Team

      What is the need for lazy evaluation of transformations in Spark?
      Why transformation is lazily evaluated and actions are eager?

    • #5626
      DataFlair TeamDataFlair Team

      Whenever a transformation operation is performed in Apache Spark, it is lazily evaluated. It won’t be executed until an action is performed. Apache Spark just adds an entry of the transformation operation to the DAG (Directed Acyclic Graph) of computation, which is a directed finite graph with no cycles. In this DAG, all the operations are classified into different stages, with no shuffling of data in a single stage.

      By this way, Spark can optimize the execution by looking at the DAG at its entirety, and return the appropriate result to the driver program.

      <stronh>For example, consider a 1TB of log file in HDFS containing errors, warnings, and other information. Below are the operations being performed in the driver program:

      1. Create an RDD of this log file
      2. Perform a flatmap() operation on this RDD to split the data in the log file based on tab delimiter.
      3. Perform a filter() operation to extract data containing only error messages
      4. Perform first() operation to fetch only the first error message.

      If all the transformations in the above driver program are eagerly evaluated, then the whole log file will be loaded into memory, all of the data within the file will be splitted based on the tab, now either it needs to write the output of FlatMap somewhere or keep it in the memory. Spark needs to wait until the next operation is performed with the resource blocked for the upcoming operation. Apart from this for each and every operation spark need to scan all the records, like for FlatMap process all the records then again process them in filter operation.

      On the other hand, if all the transformations are lazily evaluated, Spark will look at the DAG on the whole and prepare the execution plan for the application, now this plan will be optimized, the operation will be combined / merged into stages then the execution will start. The optimized plan created by Spark improves job’s efficiency and overall throughput.

      By this lazy evaluation in Spark, the number of switches between driver program and cluster is also reduced thereby saving time and resources in memory, and also there is an increase in the speed of computation.

      For more details, see
      Lazy Evaluation in Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.