How can data transfer be minimized when working with Apache Spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 4:50 pm #5963
  
  DataFlair Team
  Spectator
  
  Define the various techniques to reduce data transfer in Apache Spark.
  What are the ways to decrease data usage in Spark?
- September 20, 2018 at 4:51 pm #5965
  
  DataFlair Team
  Spectator
  
  In Spark, Data Transfer can be reduced by avoiding operation which results in data shuffle.
  Avoid operations like repartition and coalesce, ByKey operations like groupByKey and reduceByKey, and join operations like cogroup and join.
  
  Spark Shared Variables help in reducing data transfer. There two types for shared variables-Broadcast variable and Accumulator.
  
  Broadcast variable:
  
  If we have a large dataset, instead of transferring a copy of data set for each task, we can use a broadcast variable which can be copied to each node at one time
  and share the same data for each task in that node. Broadcast variable help to give a large data set to each node.
  First, we need to create a broadcast variable using SparkContext.broadcast and then broadcast the same to all nodes from driver program. Value method
  can be used to access the shared value. The broadcast variable will be used only if tasks for multiple stages use the same data.
  
  Accumulator:
  
  Spark functions used variables defined in the driver program and local copied of variables will be generated. Accumulator are shared variables which help to update
  variables in parallel during execution and share the results from workers to the driver.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

How can data transfer be minimized when working with Apache Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses