PySpark Serializers and Its Types – Marshal & Pickle
Today, in this PySpark article, “PySpark Serializers and its Types” we will discuss the whole concept of PySpark Serializers. Moreover, there are two types of serializers that PySpark supports –Â MarshalSerializer and PickleSerializer, we will also learn them in detail.
So, let’s begin PySpark Serializers.
What is PySpark Serializers?
Basically, for performance tuning on Apache Spark, Serialization is used. However, all that data which is sent over the network or written to the disk or also which is persisted in the memory must be serialized. In addition, we can say, in costly operations, serialization plays an important role.
Types of PySpark Serializers
However, for performance tuning, PySpark supports custom serializers. So, here are the two serializers which are supported by PySpark, such as −
- MarshalSerializer
- PickleSerializer
So, let’s understand types of PySpark Serializers in detail.
MarshalSerializer
By using PySpark Marshal Serializer, it Serializes Objects. On comparing with PickleSerializer, this serializer of PySpark is faster. However, it supports fewer datatypes.
class MarshalSerializer(FramedSerializer): """ http://docs.python.org/2/library/marshal.html """ dumps = marshal.dumps loads = marshal.loads
i. Instance Methods
Inherited from FramedSerializer: __init__, dump_stream, load_stream Inherited from Serializer: __eq__, __ne__ Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__
ii. Class Variables
dumps = marshal.dumps
Generally, it Serializes an object into a byte array.
loads = marshal.loads
Whereas, it Deserialize an object from a byte array.
PickleSerializer
By using PySpark’s Pickle Serializer, it Serializes Objects. However, its best feature is its supports nearly any Python object, although it is not as fast as more specialized serializers.
class PickleSerializer(FramedSerializer): """ http://docs.python.org/2/library/pickle.html """ - def dumps(self, obj): return cPickle.dumps(obj, 2) loads = cPickle.loads
i. Instance Methods
dumps(self, obj)
Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!
It Serializes an object into a byte array. However, this will be called with an array of objects, when batching is used.
Inherited from FramedSerializer: __init__, dump_stream, load_streamInherited from Serializer: __eq__, __ne__ Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__
ii. Class Variables
loads = cPickle.loads
Basically, it deserializes an object from a byte array.
So, this is all about PySpark Serializers. Hope you like our explanation.
Conclusion
Hence, we have covered all about PySpark Serializers in this article. Also, we have learned about both the types, Marshal and Pickle Serializers which are supported in PySpark, along with their codes.
Still, if any doubt occurs regarding PySpark Serializers, feel free to ask in the comment section. We will definitely get back to you.
Your opinion matters
Please write your valuable feedback about DataFlair on Google
Can I use Kryo serializer with Pyspark
No, since Kryo it’s a java serializer. It can be use with scala as an example, but for Pyspark we need to use python serializers.