PySpark StatusTracker(jtracker) | Learn PySpark

Stay updated with latest technology trends
Join DataFlair on Telegram!!

1. Objective

In our last PySpark tutorial, we discussed Pyspark Profiler. Today, in this PySpark Tutorial, “Introduction to PySpark StatusTracker” we will learn the concept of PySpark StatusTracker(jtracker).
So, let’s begin with PySpark StatusTracker(jtracker).

pyspark statustracker

Test how much you know about PySpark

2. What is PySpark StatusTracker(jtracker)?

Basically, for monitoring job and stage progress, low-level status reporting APIs are there.
However, these APIs offers very weak consistency semantics intentionally; So, we can say consumers of these APIs must be prepared to handle empty / missing information.
Let’s understand PySpark StatusTracker with an example, here a job’s stage ids may be known yet there status API may not know about the information about the details of those stages. Hence, in that case, getStageInfo could potentially return None for a valid stage id.
Have a look at PySpark Books
Make sure, these APIs only provide information on recent jobs/stages, to limit memory usage. In addition, for the last spark.ui.retainedStages stages and spark.ui.retainedJobs jobs, these APIs will provide information.

def __init__(self, jtracker):
       self._jtracker = jtracker

i. getActiveJobsIds()

Moreover, this class returns an array, which contains the ids of all active jobs.

def getActiveJobsIds(self):
               return sorted((list(self._jtracker.getActiveJobIds())))

ii. getActiveStageIds()

However, this returns an array, which contains the ids of all active stages.
Read Pros and cons of PySpark 

def getActiveStageIds(self):
       return sorted(list(self._jtracker.getActiveStageIds()))However

iii. getJobIdsForGroup(jobGroup=None)

It helps to get a list of all known jobs in a particular job group. However, it returns all known jobs that are not associated with a job group if jobGroup is None.
Basically, this returned list may contain several types of jobs, like running, failed, and completed jobs. However, this may vary across invocations of this method. Make sure, there is no guarantee of the order of the elements in its result.
Read PySpark Broadcast and Accumulator With Examples

def getJobIdsForGroup(self, jobGroup=None):
              return list(self._jtracker.getJobIdsForGroup(jobGroup))

iv. getJobInfo(jobId)

If not able to find job info or due to garbage collection, it returns a SparkJobInfo object, or None.

def getJobInfo(self, jobId):
       job = self._jtracker.getJobInfo(jobId)
       if job is not None:
           return SparkJobInfo(jobId, job.stageIds(), str(job.status()))

v. getStageInfo(stageId)

Similarly, If not able to find job info or due to garbage collection, it returns a SparkStageInfo object or None.

def getStageInfo(self, stageId):
         stage = self._jtracker.getStageInfo(stageId)
       if stage is not None:
           # TODO: fetch them in batch for better performance
           attrs = [getattr(stage, f)() for f in SparkStageInfo._fields[1:]]
           return SparkStageInfo(stageId, *attrs)

So, this is all about PySpark StatusTracker(jtracker). Hope you like our explanation.
Let’s explore PySpark RDD with Operations and Commands

3. Conclusion

Hence, we have seen the whole concept of PySpark StatusTracker(jtracker) along with their code. Still, if any doubt regarding PySpark StatusTracker(jtracker), ask in the comment tab.
See also –
PySpark MLlib – Algorithms and Parameters
For reference

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.