What are the components of runtime architecture of Spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 9:46 pm #6405
  
  DataFlair Team
  Spectator
  
  Explain the run-time architecture of Spark?
  What are the components of runtime architecture of Spark? Describe them.
- September 20, 2018 at 9:46 pm #6406
  
  DataFlair Team
  Spectator
  
  There is a well-defined and layered architecture of Apache Spark. In this architecture, components
  and layers are loosely coupled, integrated with several libraries and extensions.
  
  Apache Spark Architecture is based on two main abstractions-
  
  •Resilient Distributed Datasets (RDD)
  • Directed Acyclic Graph (DAG)
  
  Resilient Distributed Datasets (RDD)
  RDD’s are collection of data items that are split into partitions and can be
  stored in-memory on workers nodes cluster.
  In terms of datasets, apache spark supports two types of RDD’s
  1) Hadoop Datasets which are created from the files stored
  on HDFS
  2) and parallelized collections which are based on existing collections.
  
  Spark RDD’s support two different types of operations
  1) Transformations
  2) and Actions
  
  Directed Acyclic Graph (DAG)
  
  It is a sequence of computations performed on data. In this graph, each node signifies an RDD partition
  and edge signifies a transformation on top of data.
  The DAG abstraction provides performance enhancements over Hadoop and also provides helps eliminate the Hadoop MapReduce multistage execution model.
  
  With two main daemons and a cluster manager, Apache Spark follows a master/slave architecture –
  1. Master Daemon – (Master/Driver Process)
  2. Worker Daemon –(Slave Process)
  3. Cluster Manager
  
  A spark cluster has any number of Slaves/Workers and a single master.
  
  Role of Driver in Spark Architecture
  
  Spark Driver – Master Node of a Spark Application
  
  It is the central point and the entry point of the Spark Shell.
  The driver program runs the main () function of the application
  and is the place where the Spark Contextis created.
  Spark Driver contains various components
  – DAGScheduler,
  – TaskScheduler,
  – BackendScheduler
  – and BlockManager responsible for the translation of spark user code into actual spark jobs
  executed on the cluster.
  
  •As driver program runs on the master node of the spark cluster, it schedules the job execution. Also, negotiates with the cluster manager.
  • It converts the RDD’s into the execution graph. Also, splits the graph into multiple stages.
  • The metadata about all the Resilient Distributed Databases and their partitions are stored by Driver.
  • A user application is converted into smaller execution units, known as tasks by Driver program.
  then executors execute the task i.e. the worker processes which run individual tasks.
  • At port 4040, Driver exposes the information about the running spark application through a Web UI.
  
  Role of Executor in Spark Architecture
  
  An executor is a distributed agent, i.e. responsible for the execution of tasks.
  All spark applications have their own executor process.
  Executors run for the whole life of a Spark application.
  This complete phenomenon is called as “Static Allocation of Executors”.
  Although, there is one more option for dynamic allocations of executors. In this phenomenon
  we can add or remove spark executors dynamically to match with the overall workload
  
  • Executor performs all the data processing.
  • Reads from and Writes data to external sources.
  • Executor stores the computation results in-memory, cache or on hard disk drives.
  • Interacts with the storage systems.
  
  Role of Cluster Manager in Spark Architecture
  
  An external service responsible for acquiring resources on the spark cluster and allocating them to a spark job.
  
  What happens when a Spark Job is submitted?
  
  It is the driver, that converts the code containing transformations and actions into a logical directed acyclic graph (DAG) when a client submits a spark user application code.
  Also, driver program performs some optimizations like pipelining transformations.
  Afterwards, it converts the logical DAG into physical execution plan with the set of stages.
  After creating the physical execution plan, it creates small physical execution units
  referred to as tasks at each stage.
  Then tasks are bundled to be sent to the Spark Cluster.
  
  The driver program then talks to the cluster manager for resources.
  The cluster manager then launches executors on the worker nodes on behalf of the driver.
  At this point, the driver sends tasks to the cluster manager based on data placement.
  Before executors begin execution, they register themselves with the driver program so that the driver has a holistic view of all the executors. Now executors start executing
  the various tasks assigned by the driver program.
  At any point in time when the spark application is running, the driver program monitors the set of executors that run.
  
  Driver program in the spark architecture also schedules future tasks.
  Afterwards, driver programs main () method exits or when it calls the stop () method of the Spark Context,
  it will terminate all the executors and release the resources from the cluster manager.
  
  To know detailed insights, follow link: How Apache Spark works Run-time Spark Architecture
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

What are the components of runtime architecture of Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses