What do you mean by persistence in Apache Spark?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Spark What do you mean by persistence in Apache Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #5917
      DataFlair TeamDataFlair Team
      Spectator

      What is Persistence in Apache Spark?
      Explain RDD Persistence in Spark.

    • #5918
      DataFlair TeamDataFlair Team
      Spectator

      Caching or persistence are optimization techniques for Spark computations. Caching or persistence help saving intermediate partial results so they can be reused in subsequent stages for further transformation. These intermediate results as RDDs are thus kept in-memory by (default) or more solid storage like a disk.

      RDDs can be cached using cache operation. They can also be persisted using persist operation.

      Spark gives 5 types of Storage level:

      1- MEMORY_ONLY—Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.

      2- MEMORY_ONLY_SER—Store RDD as serialized Java objects (one-byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

      3- MEMORY_AND_DISK—Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on the disk, and read them from there when they’re needed.

      4- MEMORY_AND_DISK_SER—Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed.

      5- DISK_ONLY—Store the RDD partitions only on disk.

      cache() will use MEMORY_ONLY.
      persist(StorageLevel.<*type*>).

      By default, persist() will store the data in the JVM heap as unserialized objects and Cache By default storage level MEMORY_ONLY.

      RDDs can be cached using cache operation. They can also be persisted using persist operation.

      If dataset can be accessed many times or more than one times then RDD should be cached so that recomputation can be faster. If you only read a dataset once there is no point in caching it, it will actually make your job slower.

      As very small and purely syntactic difference between caching and persistence of RDDs, the two terms are often used interchangeably.

      When you persist an RDD, each node stores any partitions of it and it computes in memory and reuses them in other actions on that dataset.

      We can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

      Also, each persisted RDD can be stored using a different storage level, for example, to persist the dataset on disk, persist it in memory.

      Spark automatically persists some intermediate data in shuffle operations (reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. But best practice recommends users call persist on the resulting RDD if reuse is required.

      for in-memory computation please refer In-memory computation in Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.