PySpark Serializers and Its Types – Marshal & Pickle

1. Objective

Today, in this PySpark article, “PySpark Serializers and its Types” we will discuss the whole concept of PySpark Serializers. Moreover, there are two types of serializers that PySpark supports – MarshalSerializer and PickleSerializer, we will also learn them in detail.
So, let’s begin PySpark Serializers.

PySpark Serializers

PySpark Serializers and Its Types – Marshal & Pickle

2. What is PySpark Serializers?

Basically, for performance tuning on Apache Spark, Serialization is used. However, all that data which is sent over the network or written to the disk or also which is persisted in the memory must be serialized. In addition, we can say, in costly operations, serialization plays an important role.
Let’s read Best 5 PySpark Books

If these professionals can make a switch to Big Data, so can you:
Rahul Doddamani Story - DataFlair
Rahul Doddamani
Java → Big Data Consultant, JDA
Follow on
Mritunjay Singh Success Story - DataFlair
Mritunjay Singh
PeopleSoft → Big Data Architect, Hexaware
Follow on
Rahul Doddamani Success Story - DataFlair
Rahul Doddamani
Big Data Consultant, JDA
Follow on
I got placed, scored 100% hike, and transformed my career with DataFlair
Enroll now
Deepika Khadri Success Story - DataFlair
Deepika Khadri
SQL → Big Data Engineer, IBM
Follow on
DataFlair Web Services
You could be next!
Enroll now

3. Types of PySpark Serializers

However, for performance tuning, PySpark supports custom serializers. So, here are the two serializers which are supported by PySpark, such as −

  • MarshalSerializer
  • PickleSerializer

So, let’s understand types of PySpark Serializers in detail.

4. MarshalSerializer

By using PySpark Marshal Serializer, it Serializes Objects. On comparing with PickleSerializer, this serializer of PySpark is faster. However, it supports fewer datatypes.
Learn Pyspark Profiler – Methods and Functions

class MarshalSerializer(FramedSerializer):
     """
      http://docs.python.org/2/library/marshal.html
     """
    dumps = marshal.dumps
    loads = marshal.loads

i. Instance Methods

Inherited from FramedSerializer: __init__, dump_stream, load_stream
Inherited from Serializer: __eq__, __ne__
Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

ii. Class Variables

dumps = marshal.dumps

Generally, it Serializes an object into a byte array.
Learn PySpark RDD with Operations and Commands

Join DataFlair on Telegram
loads = marshal.loads

Whereas, it Deserialize an object from a byte array.

5. PickleSerializer

By using PySpark’s Pickle Serializer, it Serializes Objects. However, its best feature is its supports nearly any Python object, although it is not as fast as more specialized serializers.

class PickleSerializer(FramedSerializer):
     """
         http://docs.python.org/2/library/pickle.html
     """
-    def dumps(self, obj):
         return cPickle.dumps(obj, 2)
     loads = cPickle.loads

i. Instance Methods

dumps(self, obj)

It Serializes an object into a byte array. However, this will be called with an array of objects, when batching is used.

Inherited from FramedSerializer: __init__, dump_stream, load_streamInherited from Serializer: __eq__, __ne__
Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

ii. Class Variables

loads = cPickle.loads

Basically, it deserializes an object from a byte array.
So, this is all about PySpark Serializers. Hope you like our explanation.
Let’s discuss PySpark SparkContext With Examples and Parameters

6. Conclusion

Hence, we have covered all about PySpark Serializers in this article. Also, we have learned about both the types, Marshal and Pickle Serializers which are supported in PySpark, along with their codes. Still, if any doubt occurs regarding PySpark Serializers, feel free to ask in the comment section. We will definitely get back to you.
See also –
PySpark Broadcast and Accumulator With Examples
For reference

1 Response

  1. Dinesh says:

    Can I use Kryo serializer with Pyspark

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.