Pyspark Profiler – Methods and Functions

1. Objective

In our last article, we discussed PySpark MLlib – Algorithms and Parameters. Today, in this article, we will see PySpark Profiler. Moreover, we will discuss PySpark Profiler functions. Basically, to ensure that the applications do not waste any resources, we want to profile their threads to try and spot any problematic code.
So, let’s start PySpark Profiler.

Pyspark Profiler

Pyspark Profiler – Methods and Functions

Test how much you know about PySpark

2. What is Pyspark Profiler?

In PySpark, custom profilers are supported. The reason behind using custom profilers is to allow different profilers to be used.  Also, to do outputting to different formats rather than what is offered in the BasicProfiler.
However, here are some methods which a custom profiler has to define or inherit:

PySpark profiler

Methods and Functions in PySpark Profilers

i. Profile

Basically, it produces a system profile of some sort.
Let’s revise Pyspark Pros and Cons

ii. Stats

This method returns the collected stats.

iii. Dump

It dumps the profiles to a path

iv. Add

Well, this method adds a profile to the existing accumulated profile.
Make sure, at the time of creating a SparkContext, the profiler class is chosen.

>>> from pyspark import SparkConf, SparkContext
>>> from pyspark import BasicProfiler
>>> class MyCustomProfiler(BasicProfiler):
...     def show(self, id):
...         print("My custom profiles for RDD:%s" % id)
>>> conf = SparkConf().set("spark.python.profile", "true")
>>> sc = SparkContext('local', 'test', conf=conf, profiler_cls=MyCustomProfiler)
>>> sc.parallelize(range(1000)).map(lambda x: 2 * x).take(10)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
>>> sc.parallelize(range(1000)).count()
>>> sc.show_profiles()
My custom profiles for RDD:1
My custom profiles for RDD:3
>>> sc.stop()

v. Dump(id, path)

This function dump the profile into the path, here also id is the RDD id
Let’s read PySpark SparkContext

def dump(self, id, path):
       if not os.path.exists(path):
       stats = self.stats()
       if stats:
           p = os.path.join(path, "rdd_%d.pstats" % id)

vi. Profile(func)

It performs profiling on the function func.

def profile(self, func):
       raise NotImplemented

vii. Show(id)

Moreover, this function Prints the profile stats to stdout. And here id is the RDD id.

def show(self, id):
       stats = self.stats()
       if stats:
           print("=" * 60)
           print("Profile of RDD<id=%d>" % id)
           print("=" * 60)
           stats.sort_stats("time", "cumulative").print_stats()

viii. Stats()

This function returns the collected profiling stats (pstats.Stats)

def stats(self):
       raise NotImplemented

3. class pyspark.BasicProfiler(ctx)

A default profiler, that is implemented on the basis of cProfile and Accumulator, is what we call a BasicProfiler. 
Learn PySpark RDD with Operations and Commands

[docs]    def profile(self, func):
       pr = cProfile.Profile()
       st = pstats.Stats(pr) = None  # make it picklable
       # It adds a new profile to the existing accumulated value

i. profile(func)

This function runs as well as profiles the method to_profile passed in. Moreover, here a profile object is returned.

ii. stats()

def stats(self):
       return self._accumulator.value

So, this was all about PySpark Profiler. Hope you like our explanation.

4. Conclusion

Hence, we have seen whole about PySpark Profilers including their functions. Thus this article will definitely clear your concepts regarding PySpark Profilers. Still, if any doubt, ask in the comment tab.
See also –
PySpark Broadcast and Accumulator With Examples
For reference

1 Response

  1. Dilip says:

    Why we need a pyspark profiler ?

    Whats the use of it and wen we should use?

    Is it something similar to python memory profiler.

    Please any help is much appreciated.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.