What is meant by in-memory processing in Spark?

This topic has 1 reply, 1 voice, and was last updated 7 years, 9 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 10:13 pm #6446
  
  DataFlair Team
  Spectator
  
  define in-memory processing in Spark.
  what is the difference between cache() and persist method in Apache Spark?
  List various storage level in persist() method in Apache Spark.
- September 20, 2018 at 10:14 pm #6447
  
  DataFlair Team
  Spectator
  
  In-Memory Processing in Spark
  In Apache Spark, In-memory computation defines as instead of storing data in some slow disk drives the data is kept in random access memory(RAM). Also, that data is processed in parallel. By using in-memory processing, we can detect a pattern, analyze large data. It reduces the cost of memory, therefore, it became popular. So, it resulted very economic for applications. Main columns of in-memory computation are categorized as-
  
  1.RAM storage
  2.Parallel distributed processing.
  
  If we Keep the data in-memory, it improves the performance by an order of magnitudes. As RDDs are the main abstraction in Spark, RDDs are cached using persist() or the cache() method. All the RDD is stored in-memory, while we use cache() method. As RDD stores the value in memory, the data which does not fit in memory is either recalculated or the excess data is sent to disk. In addition, it can be extracted without going to disk, whenever we want RDD. This process decreases the space-time complexity and also reduces the overhead of disk storage.
  
  Spark’s in-memory capability is good for micro-batch processing and machine learning. It also offers faster execution of iterative jobs.
  
  The RDDs can also be stored in-memory while we use persist() method. Also, we can use it across parallel operations. There is only one difference between cache() and persist(). while using cache() the default storage level is MEMORY_ONLY. And, while using persist() we can use various storage levels.
  
  Storage levels of RDD Persist() in Spark
  1. MEMORY_ONLY
  RDD is stored as deserialized JAVA object in JVM. For suppose if RDD does not fit in memory, then the remaining RDD will be recomputed each time they are needed.
  
  2.MEMORY_AND_DISK
  RDD is stored as deserialized JAVA object in JVM. If the full RDD does not fit in memory, then instead of recomputing it every time when it is needed, the remaining partition is stored on disk.
  
  3.MEMORY_ONLY_SER
  
  Stores RDDs as serialized JAVA object. It stores one-byte array per partition. It is like MEMORY_ONLY but is more space efficient especially when we use fast serializer.
  
  4. MEMORY_AND_DISK_SER
  Stores RDD as serialized JAVA object. If the full RDD does not fit in the memory then, instead of recomputing it every time when we need it stores the remaining partition on the disk.
  
  5. DISK_ONLY
  stores the RDD partitions only on disk.
  
  MEMORY_ONLY_2 and MEMORY_AND_DISK_2
  
  It is as similar as MEMORY_ONLY and MEMORY_AND_DISK. The only one difference is that each partition gets replicate on two nodes in the cluster.
  
  Follow this link to learn more: Spark RDD persistence and caching mechanism.
  
  Also, learn more about In-memory, follow link: In-Memory Computing – A Beginners Guide
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

What is meant by in-memory processing in Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses