Site icon DataFlair

Top 30 PySpark Interview Questions and Answers

PySpark Interview Questions

Top 30 PySpark Interview Questions and Answers

In this PySpark article, we will go through mostly asked PySpark Interview Questions and Answers. This Interview questions for PySpark will help both freshers and experienced. Moreover, you will get a guide on how to crack PySpark Interview. Follow each link for better understanding.

So, let’s start PySpark Interview Questions.

PySpark Interview Questions

Below we are discussing best 30 PySpark Interview Questions:

Que 1. Explain PySpark in brief?

Ans. As Spark is written in Scala so in order to support Python with Spark, Spark Community released a tool, which we call PySpark. In Python programming language, we can also work with RDDs, using PySpark. It is possible due to its library name Py4j.

Que 2. What are the main characteristics of (Py)Spark?

Ans. Some of  the main characteristics of (Py)Spark are:

Que 3. Pros of PySpark?

Ans. Some of the benefits of using PySpark are:

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

 

Que 4. Cons of PySpark?

Ans.   Some of the limitations on using PySpark are:

Que 5. Prerequisites to learn PySpark?

Ans. It is being assumed that the readers are already aware of what a programming language and a framework is, before proceeding with the various concepts given in this tutorial. Also, if the readers have some knowledge of Spark and Python in advance, it will be very helpful.

Que 6. What do you mean by PySpark SparkContext?

Ans. In simple words, an entry point to any spark functionality is what we call SparkContext. While it comes to PySpark, SparkContext uses Py4J(library) in order to launch a JVM. In this way, it creates a JavaSparkContext. However, PySpark has SparkContext available as ‘sc’, by default.

Que 7. Explain PySpark SparkConf?

Ans. Mainly, we use SparkConf because we need to set a few configurations and parameters to run a Spark application on the local/cluster. In other words, SparkConf offers configurations to run a Spark application.

class pyspark.SparkConf (
  loadDefaults = True,
  _jvm = None,
  _jconf = None
)

Que 8. Tell us something about PySpark SparkFiles?

Ans. It is possible to upload our files in Apache Spark. We do it by using sc.addFile, where sc is our default SparkContext. Also, it helps to get the path on a worker using SparkFiles.get. Moreover, it resolves the paths to files which are added through SparkContext.addFile().

It contains some classmethods, such as −

 

Que 9. Explain get(filename).

Ans.  It helps to get the absolute path of a file, which are added through SparkContext.addFile().

def get(cls, filename):
              path = os.path.join(SparkFiles.getRootDirectory(), filename)
       return os.path.abspath(path)

Que 10. Explain getrootdirectory().

Ans. Whereas, it helps to get the root directory which is consist of the files which are added through SparkContext.addFile().

def getRootDirectory(cls):
              if cls._is_running_on_worker:
           return cls._root_directory
       else:
           # This will have to change if we support multiple SparkContexts:
           return cls._sc._jvm.org.apache.spark.SparkFiles.getRootDirectory()

PySpark Interview Questions for freshers – Q. 1,2,3,4,5,6,7,8

PySpark Interview Questions for experienced – Q. 9,10

Que 11. Explain PySpark StorageLevel in brief.

Ans. Basically, it controls that how an RDD should be stored. Also, it controls if to store  RDD in the memory or over the disk, or both. In addition, even it controls that we need to serialize RDD or to replicate RDD partitions.

class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)

Que 12. Name different storage levels.

Ans. There are different storage levels, which are given below −

Que 13. What do mean by Broadcast variables?

Ans. In order to save the copy of data across all nodes, we use it. 
With SparkContext.broadcast(), a broadcast variable is created. 
For Examples:

>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> b = sc.broadcast([1, 2, 3, 4, 5])
>>> b.value
[1, 2, 3, 4, 5]
>>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> b.unpersist()
>>> large_broadcast = sc.broadcast(range(10000))

Que 14. What are Accumulator variables?

Ans. In order to aggregate the information through associative and commutative operations, we use them. 

class pyspark.Accumulator(aid, value, accum_param)

Que 15. Explain AccumulatorParam?

Ans. AccumulatorParam is a helper object which explains how to accumulate values of a given type.
class AccumulatorParam(object):
   def zero(self, value):
       “””
        Also,
with the provided C{value} (e.g., a zero vector) it 
Provides a “zero value” for the type, compatible in dimensions
       “””
       raise NotImplementedError
  def addInPlace(self, value1, value2):

Que 16. Why we need Serializers in PySpark?

Ans. For the purpose of performance tuning, PySpark supports custom serializers, such as−

 

Que 17. Explain Marshal Serializer?

Ans. With the help of Python’s Marshal Serializer, it serializes objects. Even if it supports fewer datatypes, it is faster than PickleSerializer.

class MarshalSerializer(FramedSerializer):
   def dumps(self, obj):
       return marshal.dumps(obj)
    def loads(self, obj):
       return marshal.loads(obj)

Que 18. Explain Pickel Serializers?

Ans.  This uses Python’s Pickle Serializer to serialize objects. It supports nearly any Python object, but in slow speed.

class PickleSerializer(FramedSerializer):
  def dumps(self, obj):
       return pickle.dumps(obj, protocol)
   if sys.version >= '3':
       def loads(self, obj, encoding="bytes"):
           return pickle.loads(obj, encoding=encoding)
   else:
      def loads(self, obj, encoding=None):
           return pickle.loads(obj)

Que 19. What do you mean by Status Tracker?

Ans. Status Trackers are Low-level status reporting APIs which helps to monitor job and stage progress.

def __init__(self, jtracker):
       self._jtracker = jtracker

Que 20. Explain SparkJobinfo?

Ans. SparkJobinfo exposes information about Spark Jobs.

class SparkJobInfo(namedtuple("SparkJobInfo", "jobId stageIds status")):

PySpark Interview Questions for freshers – Q. 11,12,13,14,16,17,18,19

PySpark Interview Questions for experienced – Q. 15,20

Que 21. Explain SparkStageinfo?

Ans. SparkStageinfo exposes information about Spark Stages

class SparkStageInfo(namedtuple("SparkStageInfo",
                               "stageId currentAttemptId name numTasks numActiveTasks "
                               "numCompletedTasks numFailedTasks")):

Que 22. Which Profilers do we use in PySpark?

Ans.  Custom profilers are PySpark supported in PySpark to allow for different Profilers to be used an for outputting to different formats than what is offered in the BasicProfiler.
We need to define or inherit the following methods, with a custom profiler:

Generally, when we create a SparkContext, we choose the profiler class.

Que 23. Explain Basic Profiler.

Ans. It is a default profiler, which we implement on the basis of cProfile and Accumulator.

Que 24. Do, we have machine learning API in Python?

Ans. As Spark provides a Machine Learning API, MLlib. Similarly, in Python as well, PySpark has this machine learning API.

Que 25. Name algorithms supported in PySpark?

Ans. There are several algorithms in PySpark:

Que 26. Name parameter of SparkContext?

Ans. The parameters of a SparkContext are:

Que 27. Which of the parameters of SparkContext we mostly use?

Ans. Master and app name.

Que 28. Name attributes of SparkConf.

Ans. Attributes of SparkConf

  1. set(key, value) − This attribute helps to set a configuration property.
  2. setMaster(value) − It helps to set the master URL.
  3. setAppName(value) − This helps to set an application name.
  4. get(key, defaultValue=None) − This attribute helps to get a configuration value of a key.
  5. setSparkHome(value) − It helps to set Spark installation path on worker nodes.

Que 29. Why Profiler?

Ans. Profilers help us to ensure that the applications do not waste any resources also to spot any problematic code.

Que 30. State Key Differences in the Python API.

Ans.  Differences between the Python and Scala APIs are:

PySpark Interview Questions for freshers – Q. 21,22,23,25,26,27,28,29

PySpark Interview Questions for experienced – Q. 24,30

So, this was all about Pyspark Interview Questions. Hope you like our explanation.

Conclusion – PySpark Interview Questions

Hence, in this article of PySpark Interview Questions, we went through many questions and answers for the PySpark interview. This mostly asked PySpark Interview Questions will help both freshers as well as experienced. Still, if any doubt regarding PySpark Interview Questions, ask in the comment tab.

Exit mobile version