Learn PySpark StorageLevel With Example

1. Objective

Today, in this PySpark article, we will learn the whole concept of PySpark StorageLevel in depth. Basically, while it comes to store RDD, StorageLevel in Spark decides how it should be stored. So, let’s learn about Storage levels using PySpark. Also, we will learn an example of StorageLevel in PySpark to understand it well.
So, let’s start PySpark StorageLevel.

PySpark StorageLevel

PySpark StorageLevel

2. What is PySpark StorageLevel?

Well, how RDD should be stored in Apache Spark, PySpark StorageLevel decides it. Also, whether RDD should be stored in the memory or should it be stored over the disk, or both StorageLevel decides. In addition, whether to serialize RDD and whether to replicate RDD partitions, it also decides this. Moreover, for some commonly used PySpark StorageLevels, it contains static constants, like MEMORY_ONLY.
Let’s revise PySpark Profiler
So, here is the code which has the class definition of a PySpark StorageLevel −

class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)

3. Class Variables

Hence, there are different PySpark  StorageLevels, to decide the storage of RDD, such as:

  • DISK_ONLY

StorageLevel(True, False, False, False, 1)

  • DISK_ONLY_2

StorageLevel(True, False, False, False, 2)

  • MEMORY_AND_DISK

StorageLevel(True, True, False, False, 1)

  • MEMORY_AND_DISK_2

StorageLevel(True, True, False, False, 2)

  • MEMORY_AND_DISK_SER

StorageLevel(True, True, False, False, 1)

Test how much you know about PySpark 

  • MEMORY_AND_DISK_SER_

StorageLevel(True, True, False, False, 2)

  • MEMORY_ONLY

StorageLevel(False, True, False, False, 1)

  • MEMORY_ONLY_2

StorageLevel(False, True, False, False, 2)

  • MEMORY_ONLY_SER

StorageLevel(False, True, False, False, 1)

  • MEMORY_ONLY_SER_2

StorageLevel(False, True, False, False, 2)

  • OFF_HEAP

StorageLevel(True, True, True, False, 1)
Read PySpark Broadcast and Accumulator With Examples

4. Instance Methods

__init__(self, useDisk, useMemory, deserialized, replication=1)

i. Source Code for Module pyspark.storagelevel

#
  __all__ = ["StorageLevel"]
 -class StorageLevel:
      """
     Basically, Flags are for controlling the storage of an RDD. Here, each StorageLevel records whether to use memory,
      or to drop the RDD to disk if it falls out of memory. Also, it records whether to keep the data in memory
      in a serialized format, and whether to replicate the RDD partitions on multiple nodes.
      Also contains static constants for some commonly used storage levels, such as MEMORY_ONLY.
      """
 -    def __init__(self, useDisk, useMemory, deserialized, replication = 1):
         self.useDisk = useDisk
         self.useMemory = useMemory
         self.deserialized = deserialized
          self.replication = replication
  StorageLevel.DISK_ONLY = StorageLevel(True, False, False)
  StorageLevel.DISK_ONLY_2 = StorageLevel(True, False, False, 2)
  StorageLevel.MEMORY_ONLY = StorageLevel(False, True, True)
  StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, True, 2)
  StorageLevel.MEMORY_ONLY_SER = StorageLevel(False, True, False)
  StorageLevel.MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, 2)
  StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, True)
  StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, True, 2)
  StorageLevel.MEMORY_AND_DISK_SER = StorageLevel(True, True, False)
  StorageLevel.MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, 2)

5. Example of PySpark StorageLevel

So, let’s see an example of StorageLevel in PySpark, MEMORY_AND_DISK_2, it says RDD partitions will have replication of 2.

------------------------------------storagelevel.py-------------------------------------
from pyspark import SparkContext
import pyspark
sc = SparkContext (
  "local",
  "storagelevel app"
)
rdd1 = sc.parallelize([1,2])
rdd1.persist( pyspark.StorageLevel.MEMORY_AND_DISK_2 )
rdd1.getStorageLevel()
print(rdd1.getStorageLevel())
------------------------------------storagelevel.py-------------------------------------

Command:

$SPARK_HOME/bin/spark-submit storagelevel.py

Output:
Disk Memory Serialized 2x Replicated
So, this was all about PySpark StorageLevel. Hope you like our explanation.
Learn PySpark Pros and Cons | Characteristics of PySpark

6. Conclusion

Hence, we have seen the whole about PySpark StorageLevel in detail. Moreover, we discussed PySpark StorageLevel example. Also, Class variable and instance methods in StorageLevel of PySpark. Still, if any doubt occurs, please ask through comment tab.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.