In how many ways can we run Spark over Hadoop?

This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 5:07 pm #6102
  
  DataFlair Team
  Spectator
  
  What are the different methods to run Spark over Apache Hadoop?
  Describe various techniques to run Apache Spark over Hadoop.
- September 20, 2018 at 5:07 pm #6104
  
  DataFlair Team
  Spectator
  
  Apache Spark is open source cluster processing engine which handles several different data types and data sources at the same time. Spark is100 times faster than Hadoop.
  
  For on-premise deployments, Apache Spark doesn’t have a storage engine. It must rest on top of a read-write storage platform like the Hadoop Distributed File System (HDFS).
  
  Spark can read and write data from and to HDFS, and other storage systems. For example, HBase and Amazon’s S3. Hadoop Users can also enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.
  
  There are three methods to run Spark in a Hadoop cluster:
  standalone, YARN, and SIMR.
  
  Standalone deployment: In Standalone Deployment, one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can run arbitrary Spark jobs on their HDFS data. Generally, Hadoop 1.x users will use this method.
  
  Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required. Thus users can easily integrate Spark in their Hadoop stack and take advantage of Spark, as well as of other components running on Spark.
  
  Spark In MapReduce (SIMR): For the Hadoop users that are not running YARN yet, another option, in addition to the standalone deployment, It uses SIMR to launch Spark jobs inside MapReduce. Using SIMR, one can experiment with Spark. And also uses its shell within a couple of minutes after downloading. It lowers the barrier of deployment and plays with Spark.
  
  Spark interoperates not only with Hadoop but also with other popular big data technologies.
  
  Apache Hive: Spark enables Apache Hive users to run their unmodified queries much faster. Hive is a data warehouse solution running on Hadoop, while Shark allows the Hive framework to run on Spark in place of Hadoop.
  
  AWS EC2: Users can easily run Spark on top of Amazon’s EC2. Either using the scripts that come with Spark, or the hosted versions of Spark on Amazon’s Elastic MapReduce.
  
  Apache Mesos: Spark runs on top of Mesos. Mesos is a cluster manager system that provides efficient resource isolation across distributed applications, including MPI and Hadoop. Mesos enables fine-grained sharing that allows a Spark job to dynamically take advantage of the ideal resources in the cluster during execution. This results in performance improvements, especially for long-running Spark jobs.
- September 20, 2018 at 5:09 pm #6115
  
  DataFlair Team
  Spectator
  
  Spark can either in local or distributed manner in the cluster.
  
  1. Local mode – There is no resource manager in local mode. This mode is used for test the spark application in test environment where we do not want to eat the resources
  and want to run applications faster.
  Here everything run on single JVM.
  
  2. Distributed / Cluster modes:
  
  We can run spark on distributed manner with master-slave architecture.There will be multiple worker nodes in each cluster and cluster manager will be allocating the resources to each worked node.
  
  Spark can be deployed in distributed cluster in 3 ways.
  
  1.Standalone mode
  2.YARN
  3.Mesos
  
  1. Standalone:
  
  In standalone mode spark itself handle the resource allocation, their won’t be any separate cluster manager. Spark allocated the CPU and memory to worker nodes based on
  the resource availability.
  
  2. YARN :
  
  Here, YARN will be used as cluster manager. YARN distribution will be mainly used when spark running with other Hadoop components like MR in Cloudera or HortonWorks Distribution.
  YARN is a combination of Resource Manager and Node Manager.
  
  Resource manager has scheduled and Application manager.
  Scheduler: Scheduler allocate resources to various running application
  Application Manager: Manages all application across all nodes.
  
  Node Manager contains Application master and container.
  The container is the place where actual work happens.
  Application master negotiate resources from Resource manager.
  
  3. Mesos:
  
  Mesos is used in large scala production deployments. In meson distribution, all the resources available in the cluster across all nodes will be clubbed together and
  dynamic sharing of resources will be done.
  Meson master, slave, and framework are the three components of mess.
  master-provides fault tolerance
  slave- actually does the resource allocation
  framework-help the application to request for resources
  
  For more information on Spark Cluster Manager read: Cluster Manager of Spark
- September 20, 2018 at 5:09 pm #6118
  
  DataFlair Team
  Spectator
  
  We can run Spark over Hadoop in the following three ways
  
  1. Local standalone mode — everything (spark , driver , worker , etc) are on the same machine locally. Generally used for testing and develop the logic for spark application
  
  2. YARN client — the client which runs driver program and submits the job is same , and the workers (data nodes) are separate.
  
  3. Yarn Cluster — the driver program runs on one of the dedicated data nodes and the workers are separate. Most advisable for production platform.
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

In how many ways can we run Spark over Hadoop?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses