Explain Shared variable in Apache Spark.

This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 5:10 pm #6119
  
  DataFlair Team
  Spectator
  
  What is Shared variable in Spark?
- September 20, 2018 at 5:10 pm #6120
  
  DataFlair Team
  Spectator
  
  When we pass a function to any Spark operation, it executes on a remote node in the cluster. They work on different copies of all the variables in the cluster. These variable copies to each machine in the cluster. No updates are given to driver program after the commencement of any action. Supporting general, read-write shared variables across tasks would be inefficient. Apache Spark provides two types of shared variable namely broadcast variable and accumulator.
  
  Broadcast variable caches only read-variable on each machine rather than shipping a copy of it with the task. While accumulators are the variable that is “added” by the associative and commutative operation. They are supportive in parallel.
- September 20, 2018 at 5:10 pm #6121
  
  DataFlair Team
  Spectator
  
  Shared Variables
  In spark, while doing any functions/spark operation, it works on different variables used in that function. Generally, multiple copies of same variables copied to each worker node and
  the update to this variable never return to driver program. This is an inefficient way as the data transfer rate will be very high here.
  Spark Shared Variables help in reducing data transfer.There two types of shared variables-Broadcast variable and Accumulators.
  
  Broadcast variable:
  
  If we have a large dataset, instead of transferring a copy of data set for each task, we can use a broadcast variable which can be copied to each node at one time
  and share the same data for each task in that node. Broadcast variable help to give a large data set to each node.
  First, we need to create a broadcast variable using SparkContext.broadcast and then broadcast the same to all nodes from driver program. Value method can be used to access the shared value. The broadcast variable will be used only if tasks for multiple stages use the same data.
  
  Accumulator:
  
  Spark functions used variables defined in the driver program and local copies of variables will be generated. Accumulators have shared variables which help to update
  variables in parallel during execution and share the results from workers to the driver.
- September 20, 2018 at 5:10 pm #6123
  
  DataFlair Team
  Spectator
  
  Shared Variable
  Spark architecture follows a shared nothing architecture by default. At the beginning of each stage of operation, the variables are sent by driver program to the worker nodes which modify them locally.
  
  Sharing of variables between tasks may happen using Spark Context API’s
  
  1. Broadcast variables — A variable is made broadcast meaning that a read-only copy is sent to the worker node to reference it in the tasks.
  
  2. Accumulators — These are write-only variables to store the result after aggregation (e.g. sum)
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

Explain Shared variable in Apache Spark.

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses