Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Apache Spark What is difference between Caching and Persistence in Apache Spark?

This topic contains 1 reply, has 1 voice, and was last updated by  dfbdteam5 10 months ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
  • #5951


    Compare Cache vs Persist in Spark
    Where we should use cache() and where persist() ?
    Why do we need to call cache or persist on an RDD?



    Cache and Persist both are optimization techniques for Spark computations.

    Cache is a synonym of Persist with MEMORY_ONLY storage level(i.e) using Cache technique we can save intermediate results in memory only when needed.

    Persist marks an RDD for persistence using storage level which can be MEMORY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2

    Just because you can cache an RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset gets accessed and the amount of work involved in doing so, recomputation can be faster by the increased memory pressure.

    It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower.

    for detailed information on this topic read Persistence and Caching in Apache Spark

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.