PySpark StatusTracker(jtracker) | Learn PySpark
In our last PySpark tutorial, we discussed Pyspark Profiler. Today, in this PySpark Tutorial, “Introduction to PySpark StatusTracker” we will learn the concept of PySpark StatusTracker(jtracker).
So, let’s begin with PySpark StatusTracker(jtracker).
2. What is PySpark StatusTracker(jtracker)?
Basically, for monitoring job and stage progress, low-level status reporting APIs are there.
However, these APIs offers very weak consistency semantics intentionally; So, we can say consumers of these APIs must be prepared to handle empty / missing information.
Let’s understand PySpark StatusTracker with an example, here a job’s stage ids may be known yet there status API may not know about the information about the details of those stages. Hence, in that case, getStageInfo could potentially return None for a valid stage id.
Have a look at PySpark Books
Make sure, these APIs only provide information on recent jobs/stages, to limit memory usage. In addition, for the last spark.ui.retainedStages stages and spark.ui.retainedJobs jobs, these APIs will provide information.
def __init__(self, jtracker): self._jtracker = jtracker
Moreover, this class returns an array, which contains the ids of all active jobs.
def getActiveJobsIds(self): return sorted((list(self._jtracker.getActiveJobIds())))
However, this returns an array, which contains the ids of all active stages.
Read Pros and cons of PySpark
def getActiveStageIds(self): return sorted(list(self._jtracker.getActiveStageIds()))However
It helps to get a list of all known jobs in a particular job group. However, it returns all known jobs that are not associated with a job group if jobGroup is None.
Basically, this returned list may contain several types of jobs, like running, failed, and completed jobs. However, this may vary across invocations of this method. Make sure, there is no guarantee of the order of the elements in its result.
Read PySpark Broadcast and Accumulator With Examples
def getJobIdsForGroup(self, jobGroup=None): return list(self._jtracker.getJobIdsForGroup(jobGroup))
If not able to find job info or due to garbage collection, it returns a SparkJobInfo object, or None.
def getJobInfo(self, jobId): job = self._jtracker.getJobInfo(jobId) if job is not None: return SparkJobInfo(jobId, job.stageIds(), str(job.status()))
Similarly, If not able to find job info or due to garbage collection, it returns a SparkStageInfo object or None.
def getStageInfo(self, stageId): stage = self._jtracker.getStageInfo(stageId) if stage is not None: # TODO: fetch them in batch for better performance attrs = [getattr(stage, f)() for f in SparkStageInfo._fields[1:]] return SparkStageInfo(stageId, *attrs)
So, this is all about PySpark StatusTracker(jtracker). Hope you like our explanation.
Let’s explore PySpark RDD with Operations and Commands
If these professionals can make a switch to Big Data, so can you:
Hence, we have seen the whole concept of PySpark StatusTracker(jtracker) along with their code. Still, if any doubt regarding PySpark StatusTracker(jtracker), ask in the comment tab.
See also –
PySpark MLlib – Algorithms and Parameters