What is meant by in-memory processing in Spark?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Spark What is meant by in-memory processing in Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #6446
      DataFlair TeamDataFlair Team
      Spectator

      define in-memory processing in Spark.
      what is the difference between cache() and persist method in Apache Spark?
      List various storage level in persist() method in Apache Spark.

    • #6447
      DataFlair TeamDataFlair Team
      Spectator

      In-Memory Processing in Spark
      In Apache Spark, In-memory computation defines as instead of storing data in some slow disk drives the data is kept in random access memory(RAM). Also, that data is processed in parallel. By using in-memory processing, we can detect a pattern, analyze large data. It reduces the cost of memory, therefore, it became popular. So, it resulted very economic for applications. Main columns of in-memory computation are categorized as-

      1.RAM storage
      2.Parallel distributed processing.

      If we Keep the data in-memory, it improves the performance by an order of magnitudes. As RDDs are the main abstraction in Spark, RDDs are cached using persist() or the cache() method. All the RDD is stored in-memory, while we use cache() method. As RDD stores the value in memory, the data which does not fit in memory is either recalculated or the excess data is sent to disk. In addition, it can be extracted without going to disk, whenever we want RDD. This process decreases the space-time complexity and also reduces the overhead of disk storage.

      Spark’s in-memory capability is good for micro-batch processing and machine learning. It also offers faster execution of iterative jobs.

      The RDDs can also be stored in-memory while we use persist() method. Also, we can use it across parallel operations. There is only one difference between cache() and persist(). while using cache() the default storage level is MEMORY_ONLY. And, while using persist() we can use various storage levels.

      Storage levels of RDD Persist() in Spark
      1. MEMORY_ONLY
      RDD is stored as deserialized JAVA object in JVM. For suppose if RDD does not fit in memory, then the remaining RDD will be recomputed each time they are needed.

      2.MEMORY_AND_DISK
      RDD is stored as deserialized JAVA object in JVM. If the full RDD does not fit in memory, then instead of recomputing it every time when it is needed, the remaining partition is stored on disk.

      3.MEMORY_ONLY_SER

      Stores RDDs as serialized JAVA object. It stores one-byte array per partition. It is like MEMORY_ONLY but is more space efficient especially when we use fast serializer.

      4. MEMORY_AND_DISK_SER
      Stores RDD as serialized JAVA object. If the full RDD does not fit in the memory then, instead of recomputing it every time when we need it stores the remaining partition on the disk.

      5. DISK_ONLY
      stores the RDD partitions only on disk.

      MEMORY_ONLY_2 and MEMORY_AND_DISK_2

      It is as similar as MEMORY_ONLY and MEMORY_AND_DISK. The only one difference is that each partition gets replicate on two nodes in the cluster.

      Follow this link to learn more: Spark RDD persistence and caching mechanism.

      Also, learn more about In-memory, follow link: In-Memory Computing – A Beginners Guide

Viewing 1 reply thread
  • You must be logged in to reply to this topic.