PySpark Serializers and Its Types – Marshal & Pickle

1. Objective

Today, in this PySpark article, “PySpark Serializers and its Types” we will discuss the whole concept of PySpark Serializers. Moreover, there are two types of serializers that PySpark supports – MarshalSerializer and PickleSerializer, we will also learn them in detail.
So, let’s begin PySpark Serializers.

PySpark Serializers

PySpark Serializers and Its Types – Marshal & Pickle

2. What is PySpark Serializers?

Basically, for performance tuning on Apache Spark, Serialization is used. However, all that data which is sent over the network or written to the disk or also which is persisted in the memory must be serialized. In addition, we can say, in costly operations, serialization plays an important role.
Let’s read Best 5 PySpark Books

3. Types of PySpark Serializers

However, for performance tuning, PySpark supports custom serializers. So, here are the two serializers which are supported by PySpark, such as −

  • MarshalSerializer
  • PickleSerializer

So, let’s understand types of PySpark Serializers in detail.

4. MarshalSerializer

By using PySpark Marshal Serializer, it Serializes Objects. On comparing with PickleSerializer, this serializer of PySpark is faster. However, it supports fewer datatypes.
Learn Pyspark Profiler – Methods and Functions

class MarshalSerializer(FramedSerializer):
    dumps = marshal.dumps
    loads = marshal.loads

i. Instance Methods

Inherited from FramedSerializer: __init__, dump_stream, load_stream
Inherited from Serializer: __eq__, __ne__
Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

ii. Class Variables

dumps = marshal.dumps

Generally, it Serializes an object into a byte array.
Learn PySpark RDD with Operations and Commands

loads = marshal.loads

Whereas, it Deserialize an object from a byte array.

5. PickleSerializer

By using PySpark’s Pickle Serializer, it Serializes Objects. However, its best feature is its supports nearly any Python object, although it is not as fast as more specialized serializers.

class PickleSerializer(FramedSerializer):
-    def dumps(self, obj):
         return cPickle.dumps(obj, 2)
     loads = cPickle.loads

i. Instance Methods

dumps(self, obj)

It Serializes an object into a byte array. However, this will be called with an array of objects, when batching is used.

Inherited from FramedSerializer: __init__, dump_stream, load_streamInherited from Serializer: __eq__, __ne__
Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

ii. Class Variables

loads = cPickle.loads

Basically, it deserializes an object from a byte array.
So, this is all about PySpark Serializers. Hope you like our explanation.
Let’s discuss PySpark SparkContext With Examples and Parameters

6. Conclusion

Hence, we have covered all about PySpark Serializers in this article. Also, we have learned about both the types, Marshal and Pickle Serializers which are supported in PySpark, along with their codes. Still, if any doubt occurs regarding PySpark Serializers, feel free to ask in the comment section. We will definitely get back to you.
See also –
PySpark Broadcast and Accumulator With Examples
For reference

1 Response

  1. Dinesh says:

    Can I use Kryo serializer with Pyspark

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.