What is Persistence in Apache Spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 9:32 pm #6378
  
  DataFlair Team
  Spectator
  
  What do you mean by persistence?
  Explain RDD Persistence in Spark.
- September 20, 2018 at 9:32 pm #6379
  
  DataFlair Team
  Spectator
  
  RDD persistence
  RDD persistence in Spark is an optimization technique, which is used to save the intermediate results of RDD, which can be used for further evaluations if required. Thus, it reduces the computation time. This is helpful in iterative tasks, where the computations are repeated on some RDDs.
  RDD can be persisted by two methods: cache() and persist().
  For cache() method, default storage level is MEMORY_ONLY, i.e., when we persist RDD, each node stores any partition of it
  that it computes in its memory and makes it reusable for future use, hence speeding up the computation.
  
  persist() method has various storage levels:
  MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER(RDD stored as serialized java object), MEMORY_AND_DISK_SER, DISK_ONLY
  
  The cache memory of spark is fault tolerant, i.e., if any partition of RDD is lost, it can be recovered by the transformation operation that originally created it.
  
  more on Persistence and caching. Refer link: RDD Persistence and Caching Mechanism in Apache Spark
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.