Top 30 PySpark Interview Questions and Answers

DataFlair Team

6 years ago

In this PySpark article, we will go through mostly asked PySpark Interview Questions and Answers. This Interview questions for PySpark will help both freshers and experienced. Moreover, you will get a guide on how to crack PySpark Interview. Follow each link for better understanding.

So, let’s start PySpark Interview Questions.

PySpark Interview Questions

Below we are discussing best 30 PySpark Interview Questions:

Que 1. Explain PySpark in brief?

Ans. As Spark is written in Scala so in order to support Python with Spark, Spark Community released a tool, which we call PySpark. In Python programming language, we can also work with RDDs, using PySpark. It is possible due to its library name Py4j.

Que 2. What are the main characteristics of (Py)Spark?

Ans. Some of the main characteristics of (Py)Spark are:

Here Nodes are abstracted that says no possible to address an individual node.
Also, Network is abstracted, that means there is only implicit communication possible.
Moreover, it is based on Map-Reduce, that means programmer provides a map and a reduce function here.
And, PySpark is one of the API for Spark.

Que 3. Pros of PySpark?

Ans. Some of the benefits of using PySpark are:

For simple problems, it is very simple to write parallelized code.
Also, it handles Synchronization points as well as errors.
Moreover, in Spark, many useful algorithms is already implemented.

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

Que 4. Cons of PySpark?

Ans. Some of the limitations on using PySpark are:

It is difficult to express a problem in MapReduce fashion sometimes.
Also, Sometimes, it is not as efficient as other programming models.

Que 5. Prerequisites to learn PySpark?

Ans. It is being assumed that the readers are already aware of what a programming language and a framework is, before proceeding with the various concepts given in this tutorial. Also, if the readers have some knowledge of Spark and Python in advance, it will be very helpful.

Que 6. What do you mean by PySpark SparkContext?

Ans. In simple words, an entry point to any spark functionality is what we call SparkContext. While it comes to PySpark, SparkContext uses Py4J(library) in order to launch a JVM. In this way, it creates a JavaSparkContext. However, PySpark has SparkContext available as ‘sc’, by default.

Que 7. Explain PySpark SparkConf?

Ans. Mainly, we use SparkConf because we need to set a few configurations and parameters to run a Spark application on the local/cluster. In other words, SparkConf offers configurations to run a Spark application.

Code

class pyspark.SparkConf (
  loadDefaults = True,
  _jvm = None,
  _jconf = None
)

Que 8. Tell us something about PySpark SparkFiles?

Ans. It is possible to upload our files in Apache Spark. We do it by using sc.addFile, where sc is our default SparkContext. Also, it helps to get the path on a worker using SparkFiles.get. Moreover, it resolves the paths to files which are added through SparkContext.addFile().

It contains some classmethods, such as −

get(filename)
getrootdirectory()

Que 9. Explain get(filename).

Ans. It helps to get the absolute path of a file, which are added through SparkContext.addFile().

def get(cls, filename):
              path = os.path.join(SparkFiles.getRootDirectory(), filename)
       return os.path.abspath(path)

Que 10. Explain getrootdirectory().

Ans. Whereas, it helps to get the root directory which is consist of the files which are added through SparkContext.addFile().

def getRootDirectory(cls):
              if cls._is_running_on_worker:
           return cls._root_directory
       else:
           # This will have to change if we support multiple SparkContexts:
           return cls._sc._jvm.org.apache.spark.SparkFiles.getRootDirectory()

PySpark Interview Questions for freshers – Q. 1,2,3,4,5,6,7,8

PySpark Interview Questions for experienced – Q. 9,10

Que 11. Explain PySpark StorageLevel in brief.

Ans. Basically, it controls that how an RDD should be stored. Also, it controls if to store RDD in the memory or over the disk, or both. In addition, even it controls that we need to serialize RDD or to replicate RDD partitions.

Code

class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)

Que 12. Name different storage levels.

Ans. There are different storage levels, which are given below −

DISK_ONLY StorageLevel(True, False, False, False, 1)
DISK_ONLY_2 StorageLevel(True, False, False, False, 2)
MEMORY_AND_DISK StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_2 StorageLevel(True, True, False, False, 2)
MEMORY_AND_DISK_SER StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_SER_2 StorageLevel(True, True, False, False, 2)
MEMORY_ONLY StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_2StorageLevel(False, True, False, False, 2)
MEMORY_ONLY_SER StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_SER_2 StorageLevel(False, True, False, False, 2)
OFF_HEAP StorageLevel(True, True, True, False, 1)

Que 13. What do mean by Broadcast variables?

Ans. In order to save the copy of data across all nodes, we use it.
With SparkContext.broadcast(), a broadcast variable is created.
For Examples:

>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> b = sc.broadcast([1, 2, 3, 4, 5])
>>> b.value
[1, 2, 3, 4, 5]
>>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> b.unpersist()
>>> large_broadcast = sc.broadcast(range(10000))

Que 14. What are Accumulator variables?

Ans. In order to aggregate the information through associative and commutative operations, we use them.

Code

class pyspark.Accumulator(aid, value, accum_param)

Que 15. Explain AccumulatorParam?

Ans. AccumulatorParam is a helper object which explains how to accumulate values of a given type.
class AccumulatorParam(object):
   def zero(self, value):
       “””
Also,
with the provided C{value} (e.g., a zero vector) it
Provides a “zero value” for the type, compatible in dimensions
       “””
       raise NotImplementedError
  def addInPlace(self, value1, value2):

Que 16. Why we need Serializers in PySpark?

Ans. For the purpose of performance tuning, PySpark supports custom serializers, such as−

MarshalSerializer
PickleSerializer

Que 17. Explain Marshal Serializer?

Ans. With the help of Python’s Marshal Serializer, it serializes objects. Even if it supports fewer datatypes, it is faster than PickleSerializer.

class MarshalSerializer(FramedSerializer):
   def dumps(self, obj):
       return marshal.dumps(obj)
    def loads(self, obj):
       return marshal.loads(obj)

Que 18. Explain Pickel Serializers?

Ans. This uses Python’s Pickle Serializer to serialize objects. It supports nearly any Python object, but in slow speed.

class PickleSerializer(FramedSerializer):
  def dumps(self, obj):
       return pickle.dumps(obj, protocol)
   if sys.version >= '3':
       def loads(self, obj, encoding="bytes"):
           return pickle.loads(obj, encoding=encoding)
   else:
      def loads(self, obj, encoding=None):
           return pickle.loads(obj)

Que 19. What do you mean by Status Tracker?

Ans. Status Trackers are Low-level status reporting APIs which helps to monitor job and stage progress.

def __init__(self, jtracker):
       self._jtracker = jtracker

Que 20. Explain SparkJobinfo?

Ans. SparkJobinfo exposes information about Spark Jobs.

class SparkJobInfo(namedtuple("SparkJobInfo", "jobId stageIds status")):

PySpark Interview Questions for freshers – Q. 11,12,13,14,16,17,18,19

PySpark Interview Questions for experienced – Q. 15,20

Que 21. Explain SparkStageinfo?

Ans. SparkStageinfo exposes information about Spark Stages

class SparkStageInfo(namedtuple("SparkStageInfo",
                               "stageId currentAttemptId name numTasks numActiveTasks "
                               "numCompletedTasks numFailedTasks")):

Que 22. Which Profilers do we use in PySpark?

Ans. Custom profilers are PySpark supported in PySpark to allow for different Profilers to be used an for outputting to different formats than what is offered in the BasicProfiler.
We need to define or inherit the following methods, with a custom profiler:

profile – Basically, it produces a system profile of some sort.
stats – Well, it returns the collected stats.
dump – Whereas, it dumps the profiles to a path.
add – Moreover, this method helps to add a profile to the existing accumulated profile

Generally, when we create a SparkContext, we choose the profiler class.

Que 23. Explain Basic Profiler.

Ans. It is a default profiler, which we implement on the basis of cProfile and Accumulator.

Que 24. Do, we have machine learning API in Python?

Ans. As Spark provides a Machine Learning API, MLlib. Similarly, in Python as well, PySpark has this machine learning API.

Que 25. Name algorithms supported in PySpark?

Ans. There are several algorithms in PySpark:

mllib.classification
mllib.clustering
mllib.fpm
mllib.linalg
mllib.recommendation
spark.mllib
Mllib.regression

Que 26. Name parameter of SparkContext?

Ans. The parameters of a SparkContext are:

Master − URL of the cluster from which it connects.
appName − Name of our job.
sparkHome − Spark installation directory.
pyFiles − It is the .zip or .py files, in order to send to the cluster and also to add to the PYTHONPATH.
Environment − Worker nodes environment variables.
Serializer − RDD serializer.
Conf − to set all the Spark properties, an object of L{SparkConf}.
JSC − It is the JavaSparkContext instance.

Que 27. Which of the parameters of SparkContext we mostly use?

Ans. Master and app name.

Que 28. Name attributes of SparkConf.

Ans. Attributes of SparkConf −

set(key, value) − This attribute helps to set a configuration property.
setMaster(value) − It helps to set the master URL.
setAppName(value) − This helps to set an application name.
get(key, defaultValue=None) − This attribute helps to get a configuration value of a key.
setSparkHome(value) − It helps to set Spark installation path on worker nodes.

Que 29. Why Profiler?

Ans. Profilers help us to ensure that the applications do not waste any resources also to spot any problematic code.

Que 30. State Key Differences in the Python API.

Ans. Differences between the Python and Scala APIs are:

It is dynamically typed hence because of that RDDs can hold objects of multiple types.
On comparing with Scala, PySpark does not yet support some APIs.

PySpark Interview Questions for freshers – Q. 21,22,23,25,26,27,28,29

PySpark Interview Questions for experienced – Q. 24,30

So, this was all about Pyspark Interview Questions. Hope you like our explanation.

Conclusion – PySpark Interview Questions

Hence, in this article of PySpark Interview Questions, we went through many questions and answers for the PySpark interview. This mostly asked PySpark Interview Questions will help both freshers as well as experienced. Still, if any doubt regarding PySpark Interview Questions, ask in the comment tab.