In this Apache Spark lazy evaluation tutorial, we will understand what is lazy evaluation in Apache Spark, How Spark manages the lazy evaluation of Spark RDD data transformation, the reason behind keeping Spark lazy evaluation and what are the advantages of lazy evaluation in Spark transformation.
2. What is Lazy Evaluation in Apache Spark?
Before starting with lazy evaluation in Spark, let us revise Apache Spark concepts.
As the name itself indicates its definition, lazy evaluation in Spark means that the execution will not start until an action is triggered. In Spark, the picture of lazy evaluation comes when Spark transformations occur.
Transformations are lazy in nature meaning when we call some operation in RDD, it does not execute immediately. Spark maintains the record of which operation is being called(Through DAG). We can think Spark RDD as the data, that we built up through transformation. Since transformations are lazy in nature, so we can execute operation any time by calling an action on data. Hence, in lazy evaluation data is not loaded until it is necessary.
In MapReduce, much time of developer wastes in minimizing the number of MapReduce passes. It happens by clubbing the operations together. While in Spark we do not create the single execution graph, rather we club many simple operations. Thus it creates the difference between Hadoop MapReduce vs Apache Spark.
In Spark, driver program loads the code to the cluster. When the code executes after every operation, the task will be time and memory consuming. Since each time data goes to the cluster for evaluation.
3. Advantages of Lazy Evaluation in Spark Transformation
There are some benefits of Lazy evaluation in Apache Spark-
a. Increases Manageability
By lazy evaluation, users can organize their Apache Spark program into smaller operations. It reduces the number of passes on data by grouping operations.
b. Saves Computation and increases Speed
Spark Lazy Evaluation plays a key role in saving calculation overhead. Since only necessary values get compute. It saves the trip between driver and cluster, thus speeds up the process.
c. Reduces Complexities
The two main complexities of any operation are time and space complexity. Using Apache Spark lazy evaluation we can overcome both. Since we do not execute every operation, Hence, the time gets saved. It let us work with an infinite data structure. The action is triggered only when the data is required, it reduces overhead.
It provides optimization by reducing the number of queries. Learn more about Apache Spark Optimization.
Hence, Lazy evaluation enhances the power of Apache Spark by reducing the execution time of the RDD operations. It maintains the lineage graph to remember the operations on RDD. As a result, it Optimizes the performance and achieves fault tolerance.
If you like this blog or have any query so please leave a comment.