What do you mean by persistence in Apache Spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 4:42 pm #5917
  
  DataFlair Team
  Spectator
  
  What is Persistence in Apache Spark?
  Explain RDD Persistence in Spark.
- September 20, 2018 at 4:42 pm #5918
  
  DataFlair Team
  Spectator
  
  Caching or persistence are optimization techniques for Spark computations. Caching or persistence help saving intermediate partial results so they can be reused in subsequent stages for further transformation. These intermediate results as RDDs are thus kept in-memory by (default) or more solid storage like a disk.
  
  RDDs can be cached using cache operation. They can also be persisted using persist operation.
  
  Spark gives 5 types of Storage level:
  
  1- MEMORY_ONLY—Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.
  
  2- MEMORY_ONLY_SER—Store RDD as serialized Java objects (one-byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
  
  3- MEMORY_AND_DISK—Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on the disk, and read them from there when they’re needed.
  
  4- MEMORY_AND_DISK_SER—Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed.
  
  5- DISK_ONLY—Store the RDD partitions only on disk.
  
  cache() will use MEMORY_ONLY.
  persist(StorageLevel.<*type*>).
  
  By default, persist() will store the data in the JVM heap as unserialized objects and Cache By default storage level MEMORY_ONLY.
  
  RDDs can be cached using cache operation. They can also be persisted using persist operation.
  
  If dataset can be accessed many times or more than one times then RDD should be cached so that recomputation can be faster. If you only read a dataset once there is no point in caching it, it will actually make your job slower.
  
  As very small and purely syntactic difference between caching and persistence of RDDs, the two terms are often used interchangeably.
  
  When you persist an RDD, each node stores any partitions of it and it computes in memory and reuses them in other actions on that dataset.
  
  We can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
  
  Also, each persisted RDD can be stored using a different storage level, for example, to persist the dataset on disk, persist it in memory.
  
  Spark automatically persists some intermediate data in shuffle operations (reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. But best practice recommends users call persist on the resulting RDD if reuse is required.
  
  for in-memory computation please refer In-memory computation in Spark
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

What do you mean by persistence in Apache Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses