What is difference between Caching and Persistence in Apache Spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 4:49 pm #5951
  
  DataFlair Team
  Spectator
  
  Compare Cache vs Persist in Spark
  Where we should use cache() and where persist() ?
  Why do we need to call cache or persist on an RDD?
- September 20, 2018 at 4:49 pm #5953
  
  DataFlair Team
  Spectator
  
  Cache and Persist both are optimization techniques for Spark computations.
  
  Cache is a synonym of Persist with MEMORY_ONLY storage level(i.e) using Cache technique we can save intermediate results in memory only when needed.
  
  Persist marks an RDD for persistence using storage level which can be MEMORY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2
  
  Just because you can cache an RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset gets accessed and the amount of work involved in doing so, recomputation can be faster by the increased memory pressure.
  
  It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower.
  
  for detailed information on this topic read Persistence and Caching in Apache Spark
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.