Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Apache Spark › In how many ways can we run Spark over Hadoop?
September 20, 2018 at 5:07 pm #6102
What are the different methods to run Spark over Apache Hadoop?
Describe various techniques to run Apache Spark over Hadoop.September 20, 2018 at 5:07 pm #6104
Apache Spark is open source cluster processing engine which handles several different data types and data sources at the same time. Spark is100 times faster than Hadoop.
For on-premise deployments, Apache Spark doesn’t have a storage engine. It must rest on top of a read-write storage platform like the Hadoop Distributed File System (HDFS).
Spark can read and write data from and to HDFS, and other storage systems. For example, HBase and Amazon’s S3. Hadoop Users can also enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.
There are three methods to run Spark in a Hadoop cluster:
standalone, YARN, and SIMR.
Standalone deployment: In Standalone Deployment, one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can run arbitrary Spark jobs on their HDFS data. Generally, Hadoop 1.x users will use this method.
Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required. Thus users can easily integrate Spark in their Hadoop stack and take advantage of Spark, as well as of other components running on Spark.
Spark In MapReduce (SIMR): For the Hadoop users that are not running YARN yet, another option, in addition to the standalone deployment, It uses SIMR to launch Spark jobs inside MapReduce. Using SIMR, one can experiment with Spark. And also uses its shell within a couple of minutes after downloading. It lowers the barrier of deployment and plays with Spark.
Spark interoperates not only with Hadoop but also with other popular big data technologies.
Apache Hive: Spark enables Apache Hive users to run their unmodified queries much faster. Hive is a data warehouse solution running on Hadoop, while Shark allows the Hive framework to run on Spark in place of Hadoop.
AWS EC2: Users can easily run Spark on top of Amazon’s EC2. Either using the scripts that come with Spark, or the hosted versions of Spark on Amazon’s Elastic MapReduce.
Apache Mesos: Spark runs on top of Mesos. Mesos is a cluster manager system that provides efficient resource isolation across distributed applications, including MPI and Hadoop. Mesos enables fine-grained sharing that allows a Spark job to dynamically take advantage of the ideal resources in the cluster during execution. This results in performance improvements, especially for long-running Spark jobs.September 20, 2018 at 5:09 pm #6115
Spark can either in local or distributed manner in the cluster.
1. Local mode – There is no resource manager in local mode. This mode is used for test the spark application in test environment where we do not want to eat the resources
and want to run applications faster.
Here everything run on single JVM.
2. Distributed / Cluster modes:
We can run spark on distributed manner with master-slave architecture.There will be multiple worker nodes in each cluster and cluster manager will be allocating the resources to each worked node.
Spark can be deployed in distributed cluster in 3 ways.
In standalone mode spark itself handle the resource allocation, their won’t be any separate cluster manager. Spark allocated the CPU and memory to worker nodes based on
the resource availability.
2. YARN :
Here, YARN will be used as cluster manager. YARN distribution will be mainly used when spark running with other Hadoop components like MR in Cloudera or HortonWorks Distribution.
YARN is a combination of Resource Manager and Node Manager.
Resource manager has scheduled and Application manager.
Scheduler: Scheduler allocate resources to various running application
Application Manager: Manages all application across all nodes.
Node Manager contains Application master and container.
The container is the place where actual work happens.
Application master negotiate resources from Resource manager.
Mesos is used in large scala production deployments. In meson distribution, all the resources available in the cluster across all nodes will be clubbed together and
dynamic sharing of resources will be done.
Meson master, slave, and framework are the three components of mess.
master-provides fault tolerance
slave- actually does the resource allocation
framework-help the application to request for resources
For more information on Spark Cluster Manager read: Cluster Manager of SparkSeptember 20, 2018 at 5:09 pm #6118
We can run Spark over Hadoop in the following three ways
1. Local standalone mode — everything (spark , driver , worker , etc) are on the same machine locally. Generally used for testing and develop the logic for spark application
2. YARN client — the client which runs driver program and submits the job is same , and the workers (data nodes) are separate.
3. Yarn Cluster — the driver program runs on one of the dedicated data nodes and the workers are separate. Most advisable for production platform.
You must be logged in to reply to this topic.