Metadata accumulates on the driver as consequence of shuffle operations. It becomes particularly tedious during long-running jobs.
To deal with the issue of accumulating metadata, there are two options:
First, set the spark.cleaner.ttl parameter to trigger automatic cleanups. However, this will vanish any persisted RDDs.
The other solution is to simply split long-running jobs into batches and write intermediate results to disk. This facilitates a fresh environment for every batch and don’t have to worry about metadata build-up.