    Compare Cache vs Persist in Spark
    Where we should use cache() and where persist() ?
    Why do we need to call cache or persist on an RDD?



    Cache and Persist both are optimization techniques for Spark computations.

    Cache is a synonym of Persist with MEMORY_ONLY storage level(i.e) using Cache technique we can save intermediate results in memory only when needed.

    Persist marks an RDD for persistence using storage level which can be MEMORY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2

    Just because you can cache an RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset gets accessed and the amount of work involved in doing so, recomputation can be faster by the increased memory pressure.

    It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower.

    for detailed information on this topic read Persistence and Caching in Apache Spark

