What is the need for DAG in the Spark?

Viewing 1 reply thread
  • Author
    • #6441
      DataFlair Team

      Why DAG is used in the Apache Spark?
      What is its need in job execution?

    • #6442
      DataFlair Team

      Need of DAG in Spark
      As we know there were some limitations with Hadoop MapReduce. To Overcome those limitations, Apache Software introduces DAG in Spark. Let’s first study the computation process of MapReduce. It was generally carried in three steps:

      1. HDFS is used to read data.
      2. After that, we apply Map and Reduce operations.
      3. Again the result of the computation is written back to HDFS.

      In Hadoop, each MapReduce operation is independent of each other. Even HADOOP has no idea of which Map reduce may come next. Therefore, sometimes it is irrelevant to read and write back the immediate result between two map-reduce jobs for some iterations. As a result, disk memory or the memory in stable storage (HDFS) gets wasted.

      While we talk about multiple-step, all the jobs are blocked from the beginning till the completion of the previous job. Hence, complex computation can require a long time with small data volume.

      But, After DAG introduced in Spark, the execution plan is optimized, e.g. to minimize shuffling data around. Since, a DAG (Directed Acyclic Graph) of consecutive computation stages is formed.

      To learn more about DAG, follow the link: Directed Acyclic Graph(DAG) in Apache Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.