Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Apache Spark What are the benefits of lazy evaluation in RDD in Apache Spark?

This topic contains 1 reply, has 1 voice, and was last updated by  dfbdteam5 9 months, 4 weeks ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
  • #5924


    why lazy evaluation came into picture in Apache Spark?
    discuss the benefits of lazy evaluation in Apache Spark.



    Lazy evaluation means that Spark does not evaluate each transformation as they arrive, but instead queues them together and evaluate all at once, as an Action is called.

    The benefit of this approach is that Spark can make optimization decisions after it had a chance to look at the DAG in entirety. This would not be possible if it were to execute everything as soon as it got it. As a result, a large volume of Network I/O can be avoided which, otherwise, could have caused a serious bottleneck.

    Suppose we have a file words.txt containing the following lines:

    line1 word1
    line2 word2 word1
    line3 word3 word4
    line4 word1

    Next, we apply the following operations.

    scala> val lines = sc.textFile("words.txt")
    scala> val filtered = lines.filter(line => line.contains("word1"))
    scala> filtered.first()
    res0: String = line1 word1

    If Spark were to evaluate each line immediately, it would end up reading the whole file, then applying a filter transformation and then displaying the first line from the filtered result. This would mean a lot of extra work and unnecessary memory utilization.

    On the other hand, in Lazy evaluation mode, Spark first builds the entire DAG and then, using optimization techniques it understands that reading the entire file is not necessary. The same result can be achieved by just reading the first line of the file.

    Please read Lazy evaluation in Spark for more detail .

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.