What are the components of runtime architecture of Spark?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Spark What are the components of runtime architecture of Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #6405
      DataFlair TeamDataFlair Team
      Spectator

      Explain the run-time architecture of Spark?
      What are the components of runtime architecture of Spark? Describe them.

    • #6406
      DataFlair TeamDataFlair Team
      Spectator

      There is a well-defined and layered architecture of Apache Spark. In this architecture, components
      and layers are loosely coupled, integrated with several libraries and extensions.

      Apache Spark Architecture is based on two main abstractions-

      Resilient Distributed Datasets (RDD)
      • Directed Acyclic Graph (DAG)

      Resilient Distributed Datasets (RDD)
      RDD’s are collection of data items that are split into partitions and can be
      stored in-memory on workers nodes cluster.
      In terms of datasets, apache spark supports two types of RDD’s
      1) Hadoop Datasets which are created from the files stored
      on HDFS
      2) and parallelized collections which are based on existing collections.

      Spark RDD’s support two different types of operations
      1) Transformations
      2) and Actions

      Directed Acyclic Graph (DAG)

      It is a sequence of computations performed on data. In this graph, each node signifies an RDD partition
      and edge signifies a transformation on top of data.
      The DAG abstraction provides performance enhancements over Hadoop and also provides helps eliminate the Hadoop MapReduce multistage execution model.

      With two main daemons and a cluster manager, Apache Spark follows a master/slave architecture –
      1. Master Daemon – (Master/Driver Process)
      2. Worker Daemon –(Slave Process)
      3. Cluster Manager

      A spark cluster has any number of Slaves/Workers and a single master.

      Role of Driver in Spark Architecture

      Spark Driver – Master Node of a Spark Application

      It is the central point and the entry point of the Spark Shell.
      The driver program runs the main () function of the application
      and is the place where the Spark Contextis created.
      Spark Driver contains various components
      – DAGScheduler,
      – TaskScheduler,
      – BackendScheduler
      – and BlockManager responsible for the translation of spark user code into actual spark jobs
      executed on the cluster.

      •As driver program runs on the master node of the spark cluster, it schedules the job execution. Also, negotiates with the cluster manager.
      • It converts the RDD’s into the execution graph. Also, splits the graph into multiple stages.
      • The metadata about all the Resilient Distributed Databases and their partitions are stored by Driver.
      • A user application is converted into smaller execution units, known as tasks by Driver program.
      then executors execute the task i.e. the worker processes which run individual tasks.
      • At port 4040, Driver exposes the information about the running spark application through a Web UI.

      Role of Executor in Spark Architecture

      An executor is a distributed agent, i.e. responsible for the execution of tasks.
      All spark applications have their own executor process.
      Executors run for the whole life of a Spark application.
      This complete phenomenon is called as “Static Allocation of Executors”.
      Although, there is one more option for dynamic allocations of executors. In this phenomenon
      we can add or remove spark executors dynamically to match with the overall workload

      • Executor performs all the data processing.
      • Reads from and Writes data to external sources.
      • Executor stores the computation results in-memory, cache or on hard disk drives.
      • Interacts with the storage systems.

      Role of Cluster Manager in Spark Architecture

      An external service responsible for acquiring resources on the spark cluster and allocating them to a spark job.

      What happens when a Spark Job is submitted?

      It is the driver, that converts the code containing transformations and actions into a logical directed acyclic graph (DAG) when a client submits a spark user application code.
      Also, driver program performs some optimizations like pipelining transformations.
      Afterwards, it converts the logical DAG into physical execution plan with the set of stages.
      After creating the physical execution plan, it creates small physical execution units
      referred to as tasks at each stage.
      Then tasks are bundled to be sent to the Spark Cluster.

      The driver program then talks to the cluster manager for resources.
      The cluster manager then launches executors on the worker nodes on behalf of the driver.
      At this point, the driver sends tasks to the cluster manager based on data placement.
      Before executors begin execution, they register themselves with the driver program so that the driver has a holistic view of all the executors. Now executors start executing
      the various tasks assigned by the driver program.
      At any point in time when the spark application is running, the driver program monitors the set of executors that run.

      Driver program in the spark architecture also schedules future tasks.
      Afterwards, driver programs main () method exits or when it calls the stop () method of the Spark Context,
      it will terminate all the executors and release the resources from the cluster manager.

      To know detailed insights, follow link: How Apache Spark works Run-time Spark Architecture

Viewing 1 reply thread
  • You must be logged in to reply to this topic.