In this tutorial on Apache Spark cluster managers, we are going to learn what Cluster Manager in Spark is. Various types of cluster managers-Spark Standalone cluster, YARN mode, and Spark Mesos. We will learn how Apache Spark cluster managers work. The difference between Spark Standalone vs YARN vs Mesos is also covered in this blog.
2. Introduction to Apache Spark Cluster Managers
Apache Spark is an engine for Big Data processing. One can run Spark on distributed mode on the cluster. In the cluster, there is master and n number of workers. It schedules and divides resource in the host machine which forms the cluster. The prime work of the cluster manager is to divide resources across applications. It works as an external service for acquiring resources on the cluster.
The cluster manager dispatches work for the cluster. Spark supports pluggable cluster management. The cluster manager in Spark handles starting executor processes. Refer this link to learn Apache Spark terminologies and concepts.
Apache Spark system supports three types of cluster managers namely-
a) Standalone Cluster Manager
b) Hadoop YARN
c) Apache Mesos
Let’s discuss these Apache Spark Cluster Managers in detail.
2.1. Apache Spark Standalone Cluster Manager
Standalone mode is a simple cluster manager incorporated with Spark. It makes it easy to setup a cluster that Spark itself manages and can run on Linux, Windows, or Mac OSX. Often it is the simplest way to run Spark application in a clustered environment. Learn, how to install Apache Spark On Standalone Mode.
a. How does Spark Standalone Cluster Works?
It has masters and number of workers with configured amount of memory and CPU cores. In Spark standalone cluster mode, Spark allocates resources based on the core. By default, an application will grab all the cores in the cluster.
In standalone cluster manager, Zookeeper quorum recovers the master using standby master. Using the file system, we can achieve the manual recovery of the master. Spark supports authentication with the help of shared secret with entire cluster manager. The user configures each node with a shared secret. For communication protocols, Data encrypts using SSL. But for block transfer, it makes use of data SASL encryption.
To check the application, each Apache Spark application has a Web User Interface. The Web UI provides information of executors, storage usage, running task in the application. In this cluster manager, we have Web UI to view cluster and job statistics. It also has detailed log output for each job. If an application has logged event for its lifetime, Spark Web UI will reconstruct the application’s UI after the application exits.
2.2. Apache Mesos
Mesos handles the workload in distributed environment by dynamic resource sharing and isolation. It is healthful for deployment and management of applications in large-scale cluster environments. Apache Mesos clubs together the existing resource of the machines/nodes in a cluster. From this, a variety of workloads may use. This is node abstraction, thus it decreases an overhead of allocating a specific machine for different workloads. It is resource management platform for Hadoop and Big Data cluster. Companies such as Twitter, Xogito, and Airbnb use Apache Mesos as it can run on Linux or Mac OSX.
In some way, Apache Mesos is the reverse of virtualization. This is because in virtualization one physical resource divides into many virtual resources. While in Mesos many physical resources are club into a single virtual resource. Refer this link to learn Apache Mesos in detail.
a. How do Apache Mesos works?
The three components of Apache Mesos are Mesos masters, Mesos slave, Frameworks.
Mesos Master is an instance of the cluster. A cluster has many Mesos masters that provide fault tolerance. Here one instance is the leading master. Mesos Slave is Mesos instance that offers resources to the cluster. Mesos Master assigns the task to the slave. Mesos Framework allows applications to request the resources from the cluster. Thus, the application can perform the task.
Mesos authentication includes:
- The slave’s registration with the master.
- Frameworks (that is, applications) submission to the cluster.
- Operators using endpoints such as HTTP endpoints.
2.3. Hadoop YARN
YARN became the sub-project of Hadoop in the year 2012. It is also known as MapReduce 2.0. YARN bifurcate the functionality of resource manager and job scheduling into different daemons. The plan is to get a Global Resource Manager (RM) and per-application Application Master (AM). An application is either a DAG of graph or an individual job. To learn YARN is great detail follow this Yarn tutorial.
a. How do Hadoop YARN works?
YARN data computation framework is a combination of the ResourceManager, the NodeManager. It can run on Linux and Windows.
The Yarn Resource Manager manages resources among all the applications in the system. The Resource Manager has scheduler and Application Manager. The Scheduler allocates resource to the various running application. It is pure Scheduler, performs monitoring or tracking of status for the application. The Application Manager manages applications across all the nodes.
Yarn Node Manager contains Application Master and container. A container is a place where a unit of work happens. Each task of MapReduce runs in one container. The per-application Application Master is a framework specific library. It aims to negotiate resources from the Resource Manager. It continues with Node Manager(s) to execute and watch the tasks.
The application or job requires one or more containers. Node Manager handles monitoring containers, resource usage (CPU, memory, disk, and network). It reports this to the Resource Manager. YARN provides security for authentication, service level authorization. It also provides authentication for Web consoles and data confidentiality. Hadoop authentication uses Kerberos to verify that each user and service has authentication.
3. Spark Standalone mode vs YARN vs Mesos
In this tutorial of Apache Spark Cluster Managers, features of 3 modes of Spark cluster have already present. Let us now see the comparison between Standalone mode vs YARN cluster vs Mesos Cluster in Apache Spark in details. It will help you to understand which Apache Spark Cluster Managers type one should choose for Spark.
3.1. High Availability
The standalone cluster: with ZooKeeper Quorum it supports an automatic recovery of the master. One can achieve manual recovery using the file system. The cluster tolerates the worker failure despite the recovery of the Master is enabling or not.
The Apache Mesos: using Apache ZooKeeper it supports an automatic recovery of the master. In the case of failover, tasks which are currently executing, do not stop their execution.
Apache Hadoop YARN: using a command line utility it supports manual recovery. And use Zookeeper-based ActiveStandbyElector embedded in the ResourceManager for automatic recovery. Thus, there is no need to run a separate ZooKeeper Failover Controller.
Spark supports authentication via a shared secret with all the cluster managers. The standalone manager requires the user to configure each of the nodes with the shared secret. Data can be encrypted using SSL for the communication protocols. SASL encryption is supported for block transfers of data. Other options are also available for encrypting data. Access to Spark applications in the Web UI can be controlled via access control lists.
Mesos: For any entity interacting with the cluster Mesos provides authentication. This includes the slaves registering with the master, frameworks submitted to the cluster, and operators using endpoints such as HTTP endpoints. These entities can be enabling to use authentication or not. Custom module can replace Mesos’ default authentication module, Cyrus SASL.
To allow access to services in Mesos Access, it makes use of control lists. By default, communication between the modules in Mesos is unencrypted. SSL/TLS can be enabled to encrypt communication. Mesos WebUI supports HTTPS.
Hadoop YARN: It contains security for authentication, service level authorization, authentication for Web consoles and data confidentiality. Using Service level authorization it ensures that client using Hadoop services has authority. Using access control lists Hadoop services can be controlled. Additionally, using SSL data and communication between clients and services is encrypted. The data transferred between the Web console and clients with HTTPS.
Spark’s standalone cluster manager: To view cluster and job statistics it has a Web UI. It also has detailed log output for each job. the Spark Web UI will reconstruct the application’s UI after the application exists if an application has logged events for its lifetime.
Apache Mesos: It supports per container network monitoring and isolation. It provides many metrics for master and slave nodes accessible with URL. These metrics include percentage and number of allocated CPU’s, memory usage etc.
Hadoop YARN has a Web UI for the ResourceManager and the NodeManager. The ResourceManager UI provides metrics for the cluster. And the NodeManager provides information for each node, the applications and containers running on the node.
In conclusion to Apache Spark Cluster Managers, we can say Standalone mode is easy to set up among all. It will provide almost all the same features as the other cluster managers.
To use richer resource scheduling capabilities (e.g. queues), both YARN and Mesos provide these features. Of these, YARN allows you to share and configure the same pool of cluster resources between all frameworks that run on YARN. It is likely to be pre-installed on Hadoop systems.
One advantage of Mesos over both YARN and the standalone mode is its fine-grained sharing option. This lets interactive applications (Spark shell) scale down their CPU allocation between commands. This makes it attractive in environments where many users are running interactive shells.
If you like this blog or have any query about Apache Spark Cluster Managers, so, do let us know by leaving a comment.