Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › Explain Shared variable in Apache Spark.
- This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 5:10 pm #6119DataFlair TeamSpectator
What is Shared variable in Spark?
-
September 20, 2018 at 5:10 pm #6120DataFlair TeamSpectator
When we pass a function to any Spark operation, it executes on a remote node in the cluster. They work on different copies of all the variables in the cluster. These variable copies to each machine in the cluster. No updates are given to driver program after the commencement of any action. Supporting general, read-write shared variables across tasks would be inefficient. Apache Spark provides two types of shared variable namely broadcast variable and accumulator.
Broadcast variable caches only read-variable on each machine rather than shipping a copy of it with the task. While accumulators are the variable that is “added” by the associative and commutative operation. They are supportive in parallel.
-
September 20, 2018 at 5:10 pm #6121DataFlair TeamSpectator
Shared Variables
In spark, while doing any functions/spark operation, it works on different variables used in that function. Generally, multiple copies of same variables copied to each worker node and
the update to this variable never return to driver program. This is an inefficient way as the data transfer rate will be very high here.
Spark Shared Variables help in reducing data transfer.There two types of shared variables-Broadcast variable and Accumulators.Broadcast variable:
If we have a large dataset, instead of transferring a copy of data set for each task, we can use a broadcast variable which can be copied to each node at one time
and share the same data for each task in that node. Broadcast variable help to give a large data set to each node.
First, we need to create a broadcast variable using SparkContext.broadcast and then broadcast the same to all nodes from driver program. Value method can be used to access the shared value. The broadcast variable will be used only if tasks for multiple stages use the same data.Accumulator:
Spark functions used variables defined in the driver program and local copies of variables will be generated. Accumulators have shared variables which help to update
variables in parallel during execution and share the results from workers to the driver. -
September 20, 2018 at 5:10 pm #6123DataFlair TeamSpectator
Shared Variable
Spark architecture follows a shared nothing architecture by default. At the beginning of each stage of operation, the variables are sent by driver program to the worker nodes which modify them locally.Sharing of variables between tasks may happen using Spark Context API’s
1. Broadcast variables — A variable is made broadcast meaning that a read-only copy is sent to the worker node to reference it in the tasks.
2. Accumulators — These are write-only variables to store the result after aggregation (e.g. sum)
-
-
AuthorPosts
- You must be logged in to reply to this topic.