What is difference between Caching and Persistence in Apache Spark?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Spark What is difference between Caching and Persistence in Apache Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #5951
      DataFlair TeamDataFlair Team
      Spectator

      Compare Cache vs Persist in Spark
      Where we should use cache() and where persist() ?
      Why do we need to call cache or persist on an RDD?

    • #5953
      DataFlair TeamDataFlair Team
      Spectator

      Cache and Persist both are optimization techniques for Spark computations.

      Cache is a synonym of Persist with MEMORY_ONLY storage level(i.e) using Cache technique we can save intermediate results in memory only when needed.

      Persist marks an RDD for persistence using storage level which can be MEMORY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2

      Just because you can cache an RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset gets accessed and the amount of work involved in doing so, recomputation can be faster by the increased memory pressure.

      It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower.

      for detailed information on this topic read Persistence and Caching in Apache Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.