Define in-memory processing in Spark

Viewing 1 reply thread
  • Author
    Posts
    • #6429
      DataFlair TeamDataFlair Team
      Spectator

      What is meant by in-memory processing in Spark?

    • #6430
      DataFlair TeamDataFlair Team
      Spectator

      At first, we will understand, In-Memory Computing.
      The in-memory computation means instead of some slow disk drives, data is kept in random access memory(RAM). Also, processed in parallel. By using this we can detect a pattern and can analyze large data. It helps reduces the cost of memory, therefore it has become popular. Thus, it results in as economic for applications. In-memory computation has two main columns. They are-

      1. RAM storage
      2. Parallel distributed processing

      Now let’s discuss, Apache SparkIn-memory Computing

      Storing data in-memory improves the performance by an order of magnitudes. Spark’s main abstraction is Spark RDD. By using the cache() or persist() method RDDs are cached.

      All the RDD are stored in-memory while we use cache() method. The data that does not fit in memory is either recalculated or the excess data is sent to disk when RDD stores the value in memory. RDDs can be extracted, without going to disk whenever we want. This process decreases the space-time complexity and overhead of disk storage.

      For machine learning and micro-batch processing, the in-memory capability of Spark is very good. For iterative jobs, it provides faster execution.

      RDDs can also be stored in-memory when we use persist() method. Also possible to use it across parallel operations. There is only one difference between cache() and persist() method. While using persist() we can use various storage levels and by using cache() the default storage level is MEMORY_ONLY.

      To understand more, check the link: Spark In-Memory Computing – A Beginners Guide

Viewing 1 reply thread
  • You must be logged in to reply to this topic.