This Apache Spark Interview Questions and Answers blog lists commonly asked and important interview questions & answers of Apache Spark which you should prepare. Each question has the detailed answer, which will make you confident to face the interviews of Apache Spark. This guide lists frequently asked questions with tips to cracks the interview.
Before going forward on interview question follow this guide to refresh your knowledge of Apache Spark.
List of Apache Spark Interview Questions and Answers
1) What is Apache Spark?
2) What are the features and characteristics of Apache Spark?
3) What are the languages in which Apache Spark create API?
4) Compare Apache Hadoop and Apache Spark.
5) Can we run Apache Spark without Hadoop?
6) What are the benefits of Spark over MapReduce?
7) Why is Apache Spark faster than Hadoop MapReduce?
8) What are the drawbacks of Apache Spark?
9) Explain the processing speed difference between Hadoop and Apache Spark.
10) Explain various Apache Spark ecosystem components. In which scenarios can we use these components?
11) Explain Spark Core?
12) Define Spark-SQL.
13) How do we represent data in Spark?
14) What is Resilient Distributed Dataset (RDD) in Apache Spark? How does it make spark operator rich?
15) What are the major features/characteristics of RDD (Resilient Distributed Datasets)?
16) How is RDD in Apache Spark different from Distributed Storage Management?
17) Explain the operation transformation and action in Apache Spark RDD.
18) How to process data using Transformation operation in Spark?
12) Explain briefly what is Action in Apache Spark? How is final result generated using an action?
13) Compare Transformation and Action in Apache Spark.
14) How to identify that the given operation is transformation or action?
15) What are the ways to create RDDs in Apache Spark? Explain.
16) Explain benefits of lazy evaluation in RDD in Apache Spark?
17) Why is transformation lazy operation in Apache Spark RDD? How is it useful?
18) What is RDD lineage graph? How does it enable fault-tolerance in Spark?
19) What are the types of transformation in RDD in Apache Spark?
20) What is Map() operation in Apache Spark?
21) Explain the flatMap operation on Apache Spark RDD.
22) Describe the distnct(),union(),intersection() and substract() transformation in Apache Spark RDD.
23) Explain join() operation in Apache Spark
24) Explain leftOuterJoin() and rightOuterJoin() operation in Apache Spark.
25) Define fold() operation in Apache Spark.
26) What are the exact differences between reduce and fold operation in Spark?
27) Explain first() operation in Apache Spark.
28) Explain coalesce operation in Apache Spark.
29) How does pipe operation writes the result to standard output in Apache Spark?
30) List out the difference between textFile and wholeTextFile in Apache Spark.
31) Define Partition and Partitioner in Apache Spark.
32) How many partitions are created by default in Apache Spark RDD?
33) How to split single HDFS block into partitions RDD?
34) Define paired RDD in Apache Spark?
35) What are the differences between Caching and Persistence method in Apache Spark?
36) Define the run-time architecture of Spark?
38) What are the roles and responsibilities of worker nodes in the Apache Spark cluster? Is Worker Node in Spark is same as Slave Node?
39) Define various running modes of Apache Spark.
40) What is the Standalone mode in Spark cluster?
41) Write the command to start and stop the Spark in an interactive shell?
42) Define SparkContext in Apache Spark.
43) Define SparkSession in Apache Spark? Why is it needed?
44) In what ways SparkSession different from SparkContext?
45) List out the various advantages of DataFrame over RDD in Apache Spark.
46) Explain API createOrReplaceTempView().
47) What is catalyst query optimizer in Apache Spark?
48) What is a DataSet? What are its advantages over DataFrame and RDD?
49) What are the ways to run Spark over Hadoop?
50) Explain Apache Spark Streaming? How is the processing of streaming data achieved in Apache Spark?
51) What is a DStream?
52) Describe different transformations in DStream in Apache Spark Streaming.
53) Explain write ahead log(journaling) in Spark?
54) Define the level of parallelism and its need in Spark Streaming.
55) Define Parquet file format? How to convert data to Parquet format?
56) Define the common faults of the developer while using Apache Spark?
57) What is Speculative Execution in Spark?
58) What are the various types of shared variable in Apache Spark?
59) What are Broadcast Variables?
60) Describe Accumulator in detail in Apache Spark.
61) What are the ways in which Apache Spark handles accumulated Metadata?
62) Define the roles of the file system in any framework?
63) How do you parse data in XML? Which kind of class do you use with Java to parse data?
64) List some commonly used Machine Learning Algorithm Apache Spark.
65) What is PageRank?