Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › What is Persistence in Apache Spark?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 9:32 pm #6378DataFlair TeamSpectator
What do you mean by persistence?
Explain RDD Persistence in Spark. -
September 20, 2018 at 9:32 pm #6379DataFlair TeamSpectator
RDD persistence
RDD persistence in Spark is an optimization technique, which is used to save the intermediate results of RDD, which can be used for further evaluations if required. Thus, it reduces the computation time. This is helpful in iterative tasks, where the computations are repeated on some RDDs.
RDD can be persisted by two methods: cache() and persist().
For cache() method, default storage level is MEMORY_ONLY, i.e., when we persist RDD, each node stores any partition of it
that it computes in its memory and makes it reusable for future use, hence speeding up the computation.persist() method has various storage levels:
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER(RDD stored as serialized java object), MEMORY_AND_DISK_SER, DISK_ONLYThe cache memory of spark is fault tolerant, i.e., if any partition of RDD is lost, it can be recovered by the transformation operation that originally created it.
more on Persistence and caching. Refer link: RDD Persistence and Caching Mechanism in Apache Spark
-
-
AuthorPosts
- You must be logged in to reply to this topic.