Site icon DataFlair

Top 50+ Apache Spark Interview Questions and Answers

1. Best Spark Interview Questions and Answers

The Big Data technology is an umbrella term. It is emerging with time. Apache Hadoop, Apache Spark is the framework for dealing with this. The revenue of Big Data is increasing exponentially. To become a part of Bigdata industry I hope these Top 50+ Apache Spark Interview Questions and Answers will help to get an edge in Bigdata Market.

Top 50+ Apache Spark Interview Questions and Answers

As a Bigdata professional, it is necessary to know the right buzzword, and the right answer to frequently asked Spark Interview questions and answers. The Apache Spark Interview Questions and answers mentioned below revolves around the concept of Spark Core, Spark Streaming, Spark SQL, GraphX, and MLlib.
Hope this blog will act as a gateway to your Spark Job.

So, let’s explore top Spark Interview Questions and Answers.

2. Top Apache Spark Interview Questions and Answers 

Here we are going to discuss the list of Spark interview questions and answers. We have classified these Apache Spark Interview Questions according to Spark ecosystem components-

a. Apache Spark Basic Interview Questions and Answers

Here are some Frequently asked Spark Interview Questions and Answers for freshers and experienced.

Q.1 What is Apache Spark?

Apache Spark is open source, wide range data processing engine. It is data processing engine with high APIs. It allows data worker to execute streaming, machine learning or SQL workloads. These jobs need fast iterative access to datasets. Spark provides API in various languages like Python, R, Scala, Java. We can run Spark by itself or on various existing cluster manager. There is various deployment option in Spark. For example, Standalone Deploy Mode, Apache Mesos, Hadoop YARN.
The design of the Apache is so dynamic that it can integrate with all the Big Data tools. For example, Spark can access data from any of the Hadoop data source. It can also run in Hadoop data cluster. Spark does not have its own storage system. It relies on HDFS or other file storage for storing the data.

Read about Apache Spark in detail.

Q.2 Why does the picture of Spark come into existence?

To overcome the drawbacks of Apache Hadoop, Spark came into the picture. Some of the drawbacks of Hadoop that Apache Spark overcomes are:

Read about Hadoop Limitations in detail.

Q.3 What are the features of Spark?

Some of the features of Apache Spark are:

Read more Apache Spark Features in detail.

Q.4 What are the limitations of Spark?

Read more Apache Spark Limitations in detail.

Q.5 List the languages supported by Apache Spark.

Apache Spark Supports the following Languages: Scala, Java, R, Python.

Q.6 What are the cases where Apache Spark surpasses Hadoop?

The data processing speed increases in the Apache Spark. This is because of the support of in-memory computation by the system. Thus, the performance of the system increase by 10x-1000x times. Apache Spark uses various languages for distributed application development.
On the top of spark core, various libraries are present. These libraries enable workload that uses streaming, SQL, graph and machine learning. Some of these workloads are also supported by Hadoop. Spark facilitates the development by joining them into the same application. Apache Spark adopts micro-batching. Which is essentially used for handling near real time processing data model.

Q.7 Compare Hadoop and Spark.

Spark Interview Questions and Answers – Spark Ecosystem

Read Hadoop vs Spark in detail.

Q.8 What are the components of Spark Ecosystem?

The various components of Apache Spark are:

Read Apache Spark Ecosystem Components in detail.

Q.9 What is Spark Core?

Spark Core is a common execution engine for Spark platform. It provides parallel and distributed processing for large data sets. All the components on the top of it. Spark core provides speed through in-memory computation. And for ease of development, it also supports Java, Scala and Python APIs.
RDD is the basic data structure of Spark Core. RDDs are immutable, a partitioned collection of record that can operate in parallel. We can create RDDs by transformation on existing RDDs. Also by loading an external dataset from stable storage like HDFS or HBase, we can form RDD.

Q.10 How is data represented in Spark?
The data can be represented in three ways in Apache Spark: RDD, DataFrame, DataSet.

RDD: RDD stands for Resilient Distributed Datasets. It is also a Read-only partition collection of records. RDD is the fundamental data structure of Spark. Hence, RDDs can only be created through deterministic operation on either:

DataFrame: Unlike an RDD, the data organizes into named columns, like a table in a relational database. It is also an immutable distributed collection of data. DataFrame allows developers to impose a structure onto a distributed collection of data, therefore, allowing higher-level abstraction.

DataSet: Dataset is an extension of DataFrame API which provides type-safe, object-oriented programming interface. DataSet also takes advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner.

Q.11 What are the abstractions of Apache Spark?

The main abstraction provided by Apache Spark is Resilient Distributed Dataset. RDDs are fault tolerant in nature. We cannot improve the changes made in RDD. RDDs creation starts with the file in a file system like Hadoop file system and then transforming it. The shared variable is the second abstraction provided by Apache Spark. We can use this in parallel operations.
Read about Apache Spark RDD in detail.

Q.12 Explain the operations of Apache Spark RDD.

Apache Spark RDD supports two types of operations: Transformations and Actions-

Read about RDD transformations and Actions in detail.
Q.13 How many types of Transformation are there?

There are two types of transformation namely narrow transformation and wide transformation.

Apache Spark Narrow and Wide Transformations

Q.14 In how many ways RDDs can be created? Explain.

There are three ways to create an RDD:

Read the ways to create RDD in Spark in detail.
Q.15 What are Paired RDD?

Paired RDDs are the RDD-containing key-value pair. A key-value pair (KYP) contains two linked data item. Here Key is the identifier and Value are the data corresponding to the key value.

Q.16 What is meant by in-memory processing in Spark?

In in-memory computation, we keep data in random access memory in place of some slow disk drives. The processing of data is in parallel. Using this we can also identify the pattern, analyze large data Spark offers in in-memory capabilities. As a result, this increases the processing speed because it retrieves the data from memory in place of the disk. Also, the execution time of the process decreases. Keeping the data in-memory improves the performance by the order of magnitudes.
The main abstraction of Spark is its RDDs. Also, we can cache RDD using the cache() or persist() method. In cache() method all the RDD are in-memory. The dissimilarity between cache() and persist() is the default storage level. For cache() it is MEMORY_ONLY. While in persist() there are various storage levels like:

Read Spark in-memory processing in detail.

Q.17 How is fault tolerance achieved in Apache Spark?

The basic fault-tolerant semantic of Spark are:

To achieve fault tolerance for all the generated RDDs, the achieved data replicates among multiple Spark executors in worker node in the cluster. This result in two types of data that should recover in the event of failure:

Failure can also occur in worker and driver nodes.

Spark Interview Questions and Answers – Lazy Evaluation Feature

Read about Spark Fault tolerance in detail.
Q.18 What is Directed Acyclic Graph(DAG)?

RDDs are formed after every transformation. At high level when we apply action on these RDD, Spark creates a DAG. DAG is a finite directed graph with no directed cycles.

There are so many vertices and edges, where each edge is directed from one vertex to another. It contains a sequence of vertices such that every edge is directed from earlier to later in the sequence. It is a strict generalization of MapReduce model. DAG lets you get into the stage and expand in detail on any stage.

In the stage view, the details of all RDDs that belong to that stage are expanded.
Read DAG in Apache Spark in detail.
Q.19 What is lineage graph?

Lineage graph refers to the graph that has all the parent RDDs of an RDD. It is the result of all the transformation on the RDD. It creates a logical execution plan.

A logical execution plan is a plan that starts with the very first RDD. Also, it does not have any dependency on any RDD. It then ends at the RDD which produces the result of an action that has been called to execute.

Q.20 What is lazy evaluation in Spark?

The lazy evaluation known as call-by-need is a strategy that delays the execution until one requires a value. The transformation in Spark is lazy in nature. Spark evaluate them lazily. When we call some operation in RDD it does not execute immediately; Spark maintains the graph of which operation it demands. We can execute the operation at any instance by calling the action on the data. The data does not loads until it is necessary.

Read about Spark Lazy Evaluation in detail.

Q.21 What are the benefits of lazy evaluation?

Using lazy evaluation we can:

Q.22 What do you mean by Persistence?

RDD persistence is an optimization technique which saves the result of RDD evaluation. Using this we save the intermediate result for further use. It reduces the computation overhead. We can make persisted RDD through cache() and persist() methods. It is a key tool for the interactive algorithm. Because, when RDD is persisted each node stores any partition of it that it computes in memory. Thus makes it reusable for future use. This process speeds up the further computation ten times.
Read about RDD Persistence and Caching Mechanism in detail.

Q.23 Explain various level of persistence in Apache Spark.

The persist method allows seven storage level:

Q.24 Explain the run-time architecture of Spark?

The components of the run-time architecture of Spark are as follows:
a. The driver
b. Cluster manager
c. Executors

The Driver – The main() method of the program runs in the driver. The process that runs the user code which creates RDDs performs transformation and action, and also creates SparkContext is called diver. When the Spark Shell is launched, this signifies that we have created a driver program. The application finishes, as the driver terminates. Finally, driver program splits the Spark application into the task and schedules them to run on the executor.

Cluster Manager –  Spark depends on cluster manager to launch executors. In some cases, even the drivers are launched by cluster manager. It is a pluggable component in Spark. On the cluster manager, the Spark scheduler schedules the jobs and action within a spark application in FIFO fashion. Alternatively, the scheduling can also be done in Round Robin fashion. The resources used by a Spark application can also be dynamically adjusted based on the workload. Thus, the application can free unused resources and request them again when there is a demand. This is available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.

The Executors –  Each task in the Spark job runs in the Spark executors. thus, Executors are launched once in the beginning of Spark Application and then they run for the entire lifetime of an application. Even after the failure of Spark executor, the Spark application can continue with ease.
There are two main roles of the executors:

Q.25 Explain various cluster manager in Apache Spark?

The various cluster manages supported by Apache Spark are Standalone, Hadoop YARN, Apache Mesos.

Spark Interview Questions and Answers – Hadoop Compatibility

Read about Spark Cluster Managers in detail.

Q.26 In how many ways can we use Spark over Hadoop?

In three ways we can run Spark over Hadoop: Standalone, YARN, SIMR

Read about Spark Hadoop Compatibility in detail.

Q.27 What is YARN?

YARN became the sub-project of Hadoop in the year 2012. It is also known as MapReduce 2.0. The key idea behind YARN is to bifurcate the functionality of resource manager and job scheduling into different daemons. The plan is to have a Global Resource Manager(RM) and per-application Application Master (AM). An application is either a DAG of graphs or an individual job.
The data computation framework is a combination of the ResourceManager and the NodeManager.

The Resource Manager manages resource among all the applications in the system. The Resource Manager has scheduled and Application Manager. The Scheduler allocates resource to the various running application. The Scheduler is pure Scheduler if it performs no monitoring or tracking of the status of the application. The Application Manager manages applications across all the nodes. NodeManager contains ApplicationMaster and container. A container is a place where a unit of work happens. Each task of MapReduce runs in one container. The per-application ApplicationMaster is a framework specific library. It negotiates resources from the ResourceManager and continues with NodeManager(s) to execute and watch the tasks. Application or job requires one or more containers. NodeManager looks after containers, resource usage (CPU, memory, disk, and network) and reporting this to the ResourceManager.

Read about YARN in detail.

Q.28 How can we launch Spark application on YARN?

There are two deployment modes to launch Spark application on YARN: the cluster mode and the client mode.

Q.29 Define Partition in Apache Spark.

Partition refers to, a logical block of large distributed Dataset. Logically partitioning the data and distributing it over the cluster provides parallelism. It also minimizes network traffic for sending data between executors. It determines how to access the entire hardware resources during job execution. RDD is automatically partitioned in Spark. We can change the size and number of the partition.

Read Spark Catalyst Optimizer in detail.

Q.30 What are shared variables?
Shared variables are one of the abstractions of Apache Spark. Shared variables can be used in parallel operations.

Whenever Spark runs a function in parallel as a set of tasks on different nodes, each variable that is used in function are circulated to each task. Sometimes there is a need to share the variables across the tasks or between the task and the driver program.

Apache Spark supports two types of shared variables namely broadcast variable and accumulator.
Using broadcast variables we cache a value in memory on all nodes while we add accumulators to, such as counters and sums.

Q.31 What is Accumulator?

The accumulator is the type of Shared variable that is only added through associative and commutative operations. Using accumulator we can update the value of the variable while executing. We can also implement counters (as in MapReduce) or sums using an accumulator. Users can create named or unnamed accumulator. We can create numeric accumulator by calling SparkContext.longAccumulator() or SparkContext.doubleAccumulator() for Long or Double.

Q.32 What is the difference between DSM and RDD?

a) READ

b) Write:

c) Consistency:

d) Fault-recovery mechanism:

e) Straggler mitigation: Stragglers, in general, are those tasks that take more time to complete than their peers.

f) Behavior if not enough RAM:

Q.33 How can data transfer be minimized when working with Apache Spark?

By minimizing data transfer and avoiding shuffling of data we can increase the performance. In Apache Spark, we can minimize the data transfer in three ways:

Q.34 How does Apache Spark handles accumulated Metadata?

By triggering automatic cleanup Spark handles the automatic Metadata. We can trigger cleanup by setting the parameter “spark.cleaner.ttl“. the default value for this is infinite. It tells for how much duration Spark will remember the metadata. It is periodic cleaner. And also ensure that metadata older than the set duration will vanish. Thus, with its help, we can run Spark for many hours.

Q.35 What are the common faults of the developer while using Apache Spark?

The common mistake by developers are:

Q.36 Which among the two is preferable for the project- Hadoop MapReduce or Apache Spark?

The answer to this question depends on the type of project one has. As we all know Spark makes use of a large amount of RAM and also needs a dedicated machine to provide an effective result. Thus the answer depends on the project and the budget of the organization.

Q.37 List the popular use cases of Apache Spark.

The most popular use-cases of Apache Spark are:
1. Streaming
2. Machine Learning
3. interactive Analysis
4. fog computing
5. Using Spark in the real world

Q.38 What is Spark.executor.memory in a Spark Application?

The default value for this is 1 GB. It refers to the amount of memory that will be used per executor process.
We have categorized the above Spark Interview Questions and Answers for Freshers and Experienced-

Follow this link to read more Spark Basic interview Questions with Answers.

b. Spark SQL Interview Questions and Answers

In this section, we will discuss some basic Spark SQL Interview Questions and Answers.

Q.39 What is DataFrames?

It is a collection of data which organize in named columns. It is theoretically equivalent to a table in relational database. But it is more optimized. Just like RDD, DataFrames evaluates lazily. Using lazy evaluation we can optimize the execution. It optimizes by applying the techniques such as bytecode generation and predicate push-downs.

Read about Spark DataFrame in detail.

Q.40 What are the advantages of DataFrame?

  1. It makes large data set processing even easier. Data Frame also allows developers to impose a structure onto a distributed collection of data. As a result, it allows higher-level abstraction.
  2. Data frame is both space and performance efficient.
  3. It can deal with both structured and unstructured data formats, for example, Avro, CSV etc . And also storage systems like HDFS, HIVE tables, MySQL, etc.
  4. The DataFrame API’s are available in various programming languages. For example Java, Scala, Python, and R.
  5. It provides Hive compatibility. As a result, we can run unmodified Hive queries on existing Hive warehouse.
  6. Catalyst tree transformation uses DataFrame in four phases: a) Analyze logical plan to solve references. b) Logical plan optimization c) Physical planning d) Code generation to compile part of the query to Java bytecode.
  7. It can scale from kilobytes of data on the single laptop to petabytes of data on the large cluster.

Q.41 What is DataSet?

Spark Datasets are the extension of Dataframe API. It creates object-oriented programming interface and type-safety. Dataset is Spark 1.6 release. It makes use of Spark’s catalyst optimizer. It reveals expressions and data fields to a query optimizer. Dataset also influences fast in-memory encoding. It also provides provision for compile time type-safety. We can check for errors in an application when they run.

Q.42 What are the advantages of DataSets?

Q.43 Explain Catalyst framework.

The Catalyst is a framework which represents and manipulate a DataFrame graph. Data flow graph is a tree of relational operator and expressions. The three main features of catalyst are:

The TreeNode builds a query optimizer. It contains a number of the query optimizer. Catalyst Optimizer supports both rule-based and cost-based optimization. In rule-based optimization the optimizer use set of rule to determine how to execute the query. While the cost based optimization finds the most suitable way to carry out SQL statement. In cost-based optimization, many plans are generates using rules. And after this, it computes their cost. Catalyst optimizer makes use of standard features of Scala programming like pattern matching.

Q.44 List the advantage of Parquet files.

We have categorized the above frequently asked Spark Interview Questions and Answers for Freshers and Experienced-

Follow this link to read more basic Spark interview Questions with Answers.

c. Spark Streaming Interview Questions and Answers

In this section, we will discuss some basic Spark Interview Questions and Answers based on Spark Streaming.

Q.45 What is Spark Streaming?

Through Spark streaming, we achieve fault tolerant processing of live data stream. The input data can be from any source. For example, like Kafka, Flume, kinesis, twitter or HDFS/S3. It gives the data to filesystems, databases, and live dashboards after processing. The working of Spark Streaming is as under:

Q.46 What is DStream?

DStream is the high-level abstraction provided by Spark Streaming. It represents a continuous stream of data. Thus, DStream is internally a sequence of RDDs. There are two ways to create DStream:

Q.47 Explain different transformation on DStream.

DStream is a basic abstraction of Spark Streaming. It is a continuous sequence of RDD which represents a continuous stream of data. Like RDD, DStream also supports many transformations which are available on normal Spark RDD. For example, map(func), flatMap(func), filter(func) etc.

Q.48 Does Apache Spark provide checkpointing?

Yes, Apache Spark provides checkpointing. Apache supports two types of checkpointing:

Q.49 What is write ahead log(journaling)?

The write-ahead log is a technique that provides durability in a database system. It works in the way that all the operation that applies on data, we write it to write-ahead log. The logs are durable in nature. Thus, when the failure occurs we can easily recover the data from these logs. When we enable the write-ahead log Spark stores the data in fault-tolerant file system.

Q.50 What is a reliable and unreliable receiver in Spark?

We have categorized the above Spark Interview Questions and Answers for Freshers and Experienced-

d. Spark MLlib Interview Questions and Answers

Here are some Spark Interview Questions and Answers for freshers and experienced based on MLlib.

Q.51 What is Spark MLlib?

MLlib is the name of Spark’s machine learning library. The various tools provided by MLlib are:

Q.52 What is Sparse Vector?

A local vector contains both the integer type and 0-based indices. It also has double-typed values, which is stored on a single machine. In MLlib two types of local vectors are a supportive namely Dense and Sparse vector. The sparse vector is one in which most of the entries are zero.

Q.53 How to create a Sparse vector from a dense vector?

Vector sparseVector = Vectors.sparse(4, new int[] {1, 3}, new double[] {3.0, 4.0});
Read More Apache Spark Interview Questions and Answers

Exit mobile version