Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › What is the need for DAG in the Spark?
- This topic has 1 reply, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 10:06 pm #6441DataFlair TeamSpectator
Why DAG is used in the Apache Spark?
What is its need in job execution? -
September 20, 2018 at 10:06 pm #6442DataFlair TeamSpectator
Need of DAG in Spark
As we know there were some limitations with Hadoop MapReduce. To Overcome those limitations, Apache Software introduces DAG in Spark. Let’s first study the computation process of MapReduce. It was generally carried in three steps:1. HDFS is used to read data.
2. After that, we apply Map and Reduce operations.
3. Again the result of the computation is written back to HDFS.In Hadoop, each MapReduce operation is independent of each other. Even HADOOP has no idea of which Map reduce may come next. Therefore, sometimes it is irrelevant to read and write back the immediate result between two map-reduce jobs for some iterations. As a result, disk memory or the memory in stable storage (HDFS) gets wasted.
While we talk about multiple-step, all the jobs are blocked from the beginning till the completion of the previous job. Hence, complex computation can require a long time with small data volume.
But, After DAG introduced in Spark, the execution plan is optimized, e.g. to minimize shuffling data around. Since, a DAG (Directed Acyclic Graph) of consecutive computation stages is formed.
To learn more about DAG, follow the link: Directed Acyclic Graph(DAG) in Apache Spark
-
-
AuthorPosts
- You must be logged in to reply to this topic.