Lazy evaluation means that Spark does not evaluate each transformation as they arrive, but instead queues them together and evaluate all at once, as an Action is called.
The benefit of this approach is that Spark can make optimization decisions after it had a chance to look at the DAG in entirety. This would not be possible if it were to execute everything as soon as it got it. As a result, a large volume of Network I/O can be avoided which, otherwise, could have caused a serious bottleneck.
Suppose we have a file words.txt containing the following lines:
scala> val lines = sc.textFile("words.txt")
scala> val filtered = lines.filter(line => line.contains("word1"))
res0: String = line1 word1
If Spark were to evaluate each line immediately, it would end up reading the whole file, then applying a filter transformation and then displaying the first line from the filtered result. This would mean a lot of extra work and unnecessary memory utilization.
On the other hand, in Lazy evaluation mode, Spark first builds the entire DAG and then, using optimization techniques it understands that reading the entire file is not necessary. The same result can be achieved by just reading the first line of the file.