PySpark Serializers and Its Types – Marshal & Pickle

Boost your career with Free Big Data Courses!!

Today, in this PySpark article, “PySpark Serializers and its Types” we will discuss the whole concept of PySpark Serializers. Moreover, there are two types of serializers that PySpark supports – MarshalSerializer and PickleSerializer, we will also learn them in detail.

So, let’s begin PySpark Serializers.

What is PySpark Serializers?

Basically, for performance tuning on Apache Spark, Serialization is used. However, all that data which is sent over the network or written to the disk or also which is persisted in the memory must be serialized. In addition, we can say, in costly operations, serialization plays an important role.

Types of PySpark Serializers

However, for performance tuning, PySpark supports custom serializers. So, here are the two serializers which are supported by PySpark, such as −

  • MarshalSerializer
  • PickleSerializer

So, let’s understand types of PySpark Serializers in detail.

MarshalSerializer

By using PySpark Marshal Serializer, it Serializes Objects. On comparing with PickleSerializer, this serializer of PySpark is faster. However, it supports fewer datatypes.

class MarshalSerializer(FramedSerializer):
     """
      http://docs.python.org/2/library/marshal.html
     """
    dumps = marshal.dumps
    loads = marshal.loads

i. Instance Methods

Inherited from FramedSerializer: __init__, dump_stream, load_stream
Inherited from Serializer: __eq__, __ne__
Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

ii. Class Variables

dumps = marshal.dumps

Generally, it Serializes an object into a byte array.

loads = marshal.loads

Whereas, it Deserialize an object from a byte array.

PickleSerializer

By using PySpark’s Pickle Serializer, it Serializes Objects. However, its best feature is its supports nearly any Python object, although it is not as fast as more specialized serializers.

class PickleSerializer(FramedSerializer):
     """
         http://docs.python.org/2/library/pickle.html
     """
-    def dumps(self, obj):
         return cPickle.dumps(obj, 2)
     loads = cPickle.loads

i. Instance Methods

dumps(self, obj)

It Serializes an object into a byte array. However, this will be called with an array of objects, when batching is used.

Inherited from FramedSerializer: __init__, dump_stream, load_streamInherited from Serializer: __eq__, __ne__
Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

ii. Class Variables

loads = cPickle.loads

Basically, it deserializes an object from a byte array.
So, this is all about PySpark Serializers. Hope you like our explanation.

Conclusion

Hence, we have covered all about PySpark Serializers in this article. Also, we have learned about both the types, Marshal and Pickle Serializers which are supported in PySpark, along with their codes.

Still, if any doubt occurs regarding PySpark Serializers, feel free to ask in the comment section. We will definitely get back to you.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google

follow dataflair on YouTube

2 Responses

  1. Dinesh says:

    Can I use Kryo serializer with Pyspark

    • Pablo says:

      No, since Kryo it’s a java serializer. It can be use with scala as an example, but for Pyspark we need to use python serializers.

Leave a Reply

Your email address will not be published. Required fields are marked *