How can data transfer be minimized when working with Apache Spark?

Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) Forums Apache Spark How can data transfer be minimized when working with Apache Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #5963
      DataFlair Team
      Moderator

      Define the various techniques to reduce data transfer in Apache Spark.
      What are the ways to decrease data usage in Spark?

    • #5965
      DataFlair Team
      Moderator

      In Spark, Data Transfer can be reduced by avoiding operation which results in data shuffle.
      Avoid operations like repartition and coalesce, ByKey operations like groupByKey and reduceByKey, and join operations like cogroup and join.

      Spark Shared Variables help in reducing data transfer. There two types for shared variables-Broadcast variable and Accumulator.

      Broadcast variable:

      If we have a large dataset, instead of transferring a copy of data set for each task, we can use a broadcast variable which can be copied to each node at one time
      and share the same data for each task in that node. Broadcast variable help to give a large data set to each node.
      First, we need to create a broadcast variable using SparkContext.broadcast and then broadcast the same to all nodes from driver program. Value method
      can be used to access the shared value. The broadcast variable will be used only if tasks for multiple stages use the same data.

      Accumulator:

      Spark functions used variables defined in the driver program and local copied of variables will be generated. Accumulator are shared variables which help to update
      variables in parallel during execution and share the results from workers to the driver.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.