Explain Shared variable in Apache Spark.

Viewing 3 reply threads
  • Author
    Posts
    • #6119
      DataFlair TeamDataFlair Team
      Spectator

      What is Shared variable in Spark?

    • #6120
      DataFlair TeamDataFlair Team
      Spectator

      When we pass a function to any Spark operation, it executes on a remote node in the cluster. They work on different copies of all the variables in the cluster. These variable copies to each machine in the cluster. No updates are given to driver program after the commencement of any action. Supporting general, read-write shared variables across tasks would be inefficient. Apache Spark provides two types of shared variable namely broadcast variable and accumulator.

      Broadcast variable caches only read-variable on each machine rather than shipping a copy of it with the task. While accumulators are the variable that is “added” by the associative and commutative operation. They are supportive in parallel.

    • #6121
      DataFlair TeamDataFlair Team
      Spectator

      Shared Variables
      In spark, while doing any functions/spark operation, it works on different variables used in that function. Generally, multiple copies of same variables copied to each worker node and
      the update to this variable never return to driver program. This is an inefficient way as the data transfer rate will be very high here.
      Spark Shared Variables help in reducing data transfer.There two types of shared variables-Broadcast variable and Accumulators.

      Broadcast variable:

      If we have a large dataset, instead of transferring a copy of data set for each task, we can use a broadcast variable which can be copied to each node at one time
      and share the same data for each task in that node. Broadcast variable help to give a large data set to each node.
      First, we need to create a broadcast variable using SparkContext.broadcast and then broadcast the same to all nodes from driver program. Value method can be used to access the shared value. The broadcast variable will be used only if tasks for multiple stages use the same data.

      Accumulator:

      Spark functions used variables defined in the driver program and local copies of variables will be generated. Accumulators have shared variables which help to update
      variables in parallel during execution and share the results from workers to the driver.

    • #6123
      DataFlair TeamDataFlair Team
      Spectator

      Shared Variable
      Spark architecture follows a shared nothing architecture by default. At the beginning of each stage of operation, the variables are sent by driver program to the worker nodes which modify them locally.

      Sharing of variables between tasks may happen using Spark Context API’s

      1. Broadcast variables — A variable is made broadcast meaning that a read-only copy is sent to the worker node to reference it in the tasks.

      2. Accumulators — These are write-only variables to store the result after aggregation (e.g. sum)

Viewing 3 reply threads
  • You must be logged in to reply to this topic.