Learn PySpark StorageLevel With Example
Today, in this PySpark article, we will learn the whole concept of PySpark StorageLevel in depth. Basically, while it comes to store RDD, StorageLevel in Spark decides how it should be stored.
So, let’s learn about Storage levels using PySpark. Also, we will learn an example of StorageLevel in PySpark to understand it well.
So, let’s start PySpark StorageLevel.
What is PySpark StorageLevel?
Well, how RDD should be stored in Apache Spark, PySpark StorageLevel decides it. Also, whether RDD should be stored in the memory or should it be stored over the disk, or both StorageLevel decides.
In addition, whether to serialize RDD and whether to replicate RDD partitions, it also decides this. Moreover, for some commonly used PySpark StorageLevels, it contains static constants, like MEMORY_ONLY.
So, here is the code which has the class definition of a PySpark StorageLevel −
class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)
Class Variables
Hence, there are different PySpark StorageLevels, to decide the storage of RDD, such as:
- DISK_ONLY
StorageLevel(True, False, False, False, 1)
- DISK_ONLY_2
StorageLevel(True, False, False, False, 2)
- MEMORY_AND_DISK
StorageLevel(True, True, False, False, 1)
- MEMORY_AND_DISK_2
StorageLevel(True, True, False, False, 2)
- MEMORY_AND_DISK_SER
StorageLevel(True, True, False, False, 1)
- MEMORY_AND_DISK_SER_
StorageLevel(True, True, False, False, 2)
- MEMORY_ONLY
StorageLevel(False, True, False, False, 1)
- MEMORY_ONLY_2
StorageLevel(False, True, False, False, 2)
- MEMORY_ONLY_SER
StorageLevel(False, True, False, False, 1)
- MEMORY_ONLY_SER_2
StorageLevel(False, True, False, False, 2)
- OFF_HEAP
StorageLevel(True, True, True, False, 1)
Instance Methods
__init__(self, useDisk, useMemory, deserialized, replication=1)
i. Source Code for Module pyspark.storagelevel
# __all__ = ["StorageLevel"] -class StorageLevel: """ Basically, Flags are for controlling the storage of an RDD. Here, each StorageLevel records whether to use memory, or to drop the RDD to disk if it falls out of memory. Also, it records whether to keep the data in memory in a serialized format, and whether to replicate the RDD partitions on multiple nodes. Also contains static constants for some commonly used storage levels, such as MEMORY_ONLY. """ - def __init__(self, useDisk, useMemory, deserialized, replication = 1): self.useDisk = useDisk self.useMemory = useMemory self.deserialized = deserialized self.replication = replication StorageLevel.DISK_ONLY = StorageLevel(True, False, False) StorageLevel.DISK_ONLY_2 = StorageLevel(True, False, False, 2) StorageLevel.MEMORY_ONLY = StorageLevel(False, True, True) StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, True, 2) StorageLevel.MEMORY_ONLY_SER = StorageLevel(False, True, False) StorageLevel.MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, 2) StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, True) StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, True, 2) StorageLevel.MEMORY_AND_DISK_SER = StorageLevel(True, True, False) StorageLevel.MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, 2)
Example of PySpark StorageLevel
So, let’s see an example of StorageLevel in PySpark, MEMORY_AND_DISK_2, it says RDD partitions will have replication of 2.
------------------------------------storagelevel.py------------------------------------- from pyspark import SparkContext import pyspark sc = SparkContext ( "local", "storagelevel app" ) rdd1 = sc.parallelize([1,2]) rdd1.persist( pyspark.StorageLevel.MEMORY_AND_DISK_2 ) rdd1.getStorageLevel() print(rdd1.getStorageLevel()) ------------------------------------storagelevel.py-------------------------------------
Command:
$SPARK_HOME/bin/spark-submit storagelevel.py
Output:
Disk Memory Serialized 2x Replicated
So, this was all about PySpark StorageLevel. Hope you like our explanation.
Conclusion
Hence, we have seen the whole about PySpark StorageLevel in detail. Moreover, we discussed PySpark StorageLevel example. Also, Class variable and instance methods in StorageLevel of PySpark. Still, if any doubt occurs, please ask through comment tab.
We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google