In Spark, Data Transfer can be reduced by avoiding operation which results in data shuffle.
Avoid operations like repartition and coalesce, ByKey operations like groupByKey and reduceByKey, and join operations like cogroup and join.
Spark Shared Variables help in reducing data transfer. There two types for shared variables-Broadcast variable and Accumulator.
If we have a large dataset, instead of transferring a copy of data set for each task, we can use a broadcast variable which can be copied to each node at one time
and share the same data for each task in that node. Broadcast variable help to give a large data set to each node.
First, we need to create a broadcast variable using SparkContext.broadcast and then broadcast the same to all nodes from driver program. Value method
can be used to access the shared value. The broadcast variable will be used only if tasks for multiple stages use the same data.
Spark functions used variables defined in the driver program and local copied of variables will be generated. Accumulator are shared variables which help to update
variables in parallel during execution and share the results from workers to the driver.