What is Persistence in Apache Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #6378
      DataFlair TeamDataFlair Team
      Spectator

      What do you mean by persistence?
      Explain RDD Persistence in Spark.

    • #6379
      DataFlair TeamDataFlair Team
      Spectator

      RDD persistence
      RDD persistence in Spark is an optimization technique, which is used to save the intermediate results of RDD, which can be used for further evaluations if required. Thus, it reduces the computation time. This is helpful in iterative tasks, where the computations are repeated on some RDDs.
      RDD can be persisted by two methods: cache() and persist().
      For cache() method, default storage level is MEMORY_ONLY, i.e., when we persist RDD, each node stores any partition of it
      that it computes in its memory and makes it reusable for future use, hence speeding up the computation.

      persist() method has various storage levels:
      MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER(RDD stored as serialized java object), MEMORY_AND_DISK_SER, DISK_ONLY

      The cache memory of spark is fault tolerant, i.e., if any partition of RDD is lost, it can be recovered by the transformation operation that originally created it.

      more on Persistence and caching. Refer link: RDD Persistence and Caching Mechanism in Apache Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.