Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › How can data transfer be minimized when working with Apache Spark?
- This topic has 1 reply, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 4:50 pm #5963DataFlair TeamSpectator
Define the various techniques to reduce data transfer in Apache Spark.
What are the ways to decrease data usage in Spark? -
September 20, 2018 at 4:51 pm #5965DataFlair TeamSpectator
In Spark, Data Transfer can be reduced by avoiding operation which results in data shuffle.
Avoid operations like repartition and coalesce, ByKey operations like groupByKey and reduceByKey, and join operations like cogroup and join.Spark Shared Variables help in reducing data transfer. There two types for shared variables-Broadcast variable and Accumulator.
Broadcast variable:
If we have a large dataset, instead of transferring a copy of data set for each task, we can use a broadcast variable which can be copied to each node at one time
and share the same data for each task in that node. Broadcast variable help to give a large data set to each node.
First, we need to create a broadcast variable using SparkContext.broadcast and then broadcast the same to all nodes from driver program. Value method
can be used to access the shared value. The broadcast variable will be used only if tasks for multiple stages use the same data.Accumulator:
Spark functions used variables defined in the driver program and local copied of variables will be generated. Accumulator are shared variables which help to update
variables in parallel during execution and share the results from workers to the driver.
-
-
AuthorPosts
- You must be logged in to reply to this topic.