PySpark StatusTracker(jtracker) | Learn PySpark

Boost your career with Free Big Data Courses!!

In our last PySpark tutorial, we discussed Pyspark Profiler. Today, in this PySpark Tutorial, “Introduction to PySpark StatusTracker” we will learn the concept of PySpark StatusTracker(jtracker).
So, let’s begin with PySpark StatusTracker(jtracker).

What is PySpark StatusTracker(jtracker)?

Basically, for monitoring job and stage progress, low-level status reporting APIs are there.
However, these APIs offers very weak consistency semantics intentionally; So, we can say consumers of these APIs must be prepared to handle empty / missing information.

Let’s understand PySpark StatusTracker with an example, here a job’s stage ids may be known yet there status API may not know about the information about the details of those stages. Hence, in that case, getStageInfo could potentially return None for a valid stage id.

Make sure, these APIs only provide information on recent jobs/stages, to limit memory usage. In addition, for the last spark.ui.retainedStages stages and spark.ui.retainedJobs jobs, these APIs will provide information.

def __init__(self, jtracker):
       self._jtracker = jtracker

i. getActiveJobsIds()

Moreover, this class returns an array, which contains the ids of all active jobs.

def getActiveJobsIds(self):
               return sorted((list(self._jtracker.getActiveJobIds())))

ii. getActiveStageIds()

However, this returns an array, which contains the ids of all active stages.
Read Pros and cons of PySpark 

def getActiveStageIds(self):
       return sorted(list(self._jtracker.getActiveStageIds()))However

iii. getJobIdsForGroup(jobGroup=None)

It helps to get a list of all known jobs in a particular job group. However, it returns all known jobs that are not associated with a job group if jobGroup is None.

Basically, this returned list may contain several types of jobs, like running, failed, and completed jobs. However, this may vary across invocations of this method. Make sure, there is no guarantee of the order of the elements in its result.

def getJobIdsForGroup(self, jobGroup=None):
              return list(self._jtracker.getJobIdsForGroup(jobGroup))

iv. getJobInfo(jobId)

If not able to find job info or due to garbage collection, it returns a SparkJobInfo object, or None.

def getJobInfo(self, jobId):
       job = self._jtracker.getJobInfo(jobId)
       if job is not None:
           return SparkJobInfo(jobId, job.stageIds(), str(job.status()))

v. getStageInfo(stageId)

Similarly, If not able to find job info or due to garbage collection, it returns a SparkStageInfo object or None.

def getStageInfo(self, stageId):
         stage = self._jtracker.getStageInfo(stageId)
       if stage is not None:
           # TODO: fetch them in batch for better performance
           attrs = [getattr(stage, f)() for f in SparkStageInfo._fields[1:]]
           return SparkStageInfo(stageId, *attrs)

So, this is all about PySpark StatusTracker(jtracker). Hope you like our explanation.

Conclusion

Hence, we have seen the whole concept of PySpark StatusTracker(jtracker) along with their code. Still, if any doubt regarding PySpark StatusTracker(jtracker), ask in the comment tab.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google

follow dataflair on YouTube

Leave a Reply

Your email address will not be published. Required fields are marked *