Here we are going to learn how Apache Spark Works? In Apache Spark, the central coordinator is called the driver. When you enter your code in spark, SparkContext in the driver program creates the job when an Action is called. This job is submitted to DAG Scheduler which creates the operator graph and then submits it to task Scheduler. Task Scheduler launches the task via cluster manager. Thus, with the help of cluster manager, a Spark Application is launched on a set of machines.
This Apache Spark tutorial will explain the run-time architecture of Apache Spark along with key Spark terminologies like Apache SparkContext, Spark shell, Apache Spark application, task, job and stages in Spark. We will also learn about the components of Spark run time architecture like the Spark driver, cluster manager & Spark executors. At last, we will see how Apache spark works using these components.
2. Internals of How Apache Spark works?
Apache Spark is an open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data. Just like Hadoop MapReduce, it also works with the system to distribute data across the cluster and process the data in parallel. Spark uses master/slave architecture i.e. one central coordinator and many distributed workers. Here, the central coordinator is called the driver.
The driver runs in its own Java process. These drivers communicate with a potentially large number of distributed workers called executors. Each executor is a separate java process. A Spark Application is a combination of driver and its own executors. With the help of cluster manager, a Spark Application is launched on a set of machines. Standalone Cluster Manager is the default built in cluster manager of Spark. Apart from its built-in cluster manager, Spark also works with some open source cluster manager like Hadoop Yarn, Apache Mesos etc.
3. Terminologies of Spark
3.1. Apache SparkContext
SparkContext is the heart of Spark Application. It establishes a connection to the Spark Execution environment. It is used to create Spark RDDs, accumulators, and broadcast variables, access Spark services and run jobs. SparkContext is a client of Spark execution environment and acts as the master of Spark application. The main works of Spark Context are:
- Getting the current status of spark application
- Canceling the job
- Canceling the Stage
- Running job synchronously
- Running job asynchronously
- Accessing persistent RDD
- Unpersisting RDD
- Programmable dynamic allocation
3.2. Apache Spark Shell
Spark Shell is a Spark Application written in Scala. It offers command line environment with auto-completion. It helps us to get familiar with the features of Spark, which help in developing our own Standalone Spark Application. Thus, this tool helps in exploring Spark and is also the reason why Spark is so helpful in processing data set of all size.
3.3. Spark Application
The Spark application is a self-contained computation that runs user-supplied code to compute a result. A Spark application can have processes running on its behalf even when it’s not running a job.
A task is a unit of work that is sent to the executor. Each stage has some task, one task per partition. The Same task is done over different partitions of RDD.
The job is parallel computation consisting of multiple tasks that get spawned in response to actions in Apache Spark.
Each job gets divided into smaller sets of tasks called stages that depend on each other. Stages are classified as computational boundaries. All computation cannot be done in single stage. It is achieved over many stages.
4. Components of Spark Run-time Architecture of Spark
4.1. Apache Spark Driver
The main() method of the program runs in the driver. The driver is the process that runs the user code that creates RDDs, and performs transformation and action, and also creates SparkContext. When the Spark Shell is launched, this signifies that we have created a driver program. On the termination of the driver, the application is finished.
The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler resides in the driver and distributes task among workers. The two main key roles of drivers are:
- Converting user program into the task.
- Scheduling task on the executor.
The structure of Spark program at a higher level is: RDDs are created from some input data, derive new RDD from existing using various transformations, and then after it performs an action to compute data. In Spark Program, the DAG (directed acyclic graph) of operations are created implicitly. And when the driver runs, it converts that Spark DAG into a physical execution plan.
4.2. Apache Spark Cluster Manager
Spark relies on cluster manager to launch executors and in some cases, even the drivers are launched through it. It is a pluggable component in Spark. On the cluster manager, jobs and action within a spark application are scheduled by Spark Scheduler in a FIFO fashion. Alternatively, the scheduling can also be done in Round Robin fashion. The resources used by a Spark application can be dynamically adjusted based on the workload. Thus, the application can free unused resources and request them again when there is a demand. This is available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.
4.3. Apache Spark Executors
The individual task in the given Spark job runs in the Spark executors. Executors are launched once in the beginning of Spark Application and then they run for the entire lifetime of an application. Even if the Spark executor fails, the Spark application can continue with ease. There are two main roles of the executors:
- Runs the task that makes up the application and returns the result to the driver.
- Provide in-memory storage for RDDs that are cached by the user.
5. How to launch a Program in Spark?
Despite using any cluster manager, Spark comes with the facility of a single script that can be used to submit a program, called as spark-submit. It launches the application on the cluster. There are various options through which spark-submit can connect to different cluster manager and control how many resources our application gets. For some cluster managers, spark-submit can run the driver within the cluster (e.g., on a YARN worker node), while for others, it can run only on your local machine.
6. How to Run Apache Spark Application on a cluster
- Using spark-submit, the user submits an application.
- In spark-submit, the main() method specified by the user is invoked. It also launches the driver program.
- The driver program asks for the resources to the cluster manager that is required to launch executors.
- The cluster manager launches executors on behalf of the driver program.
- The driver process runs with the help of user application. Based on the actions and transformation on RDDs, the driver sends work to executors in the form of tasks.
- The executors process the task and the result is sent back to the driver through cluster manager.
If you have any query about Apache Spark job execution flow, so feel free to share with us. We will be happy to solve them.