What is the need for DAG in the Spark?

This topic has 1 reply, 1 voice, and was last updated 7 years, 10 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 10:06 pm #6441
  
  DataFlair Team
  Spectator
  
  Why DAG is used in the Apache Spark?
  What is its need in job execution?
- September 20, 2018 at 10:06 pm #6442
  
  DataFlair Team
  Spectator
  
  Need of DAG in Spark
  As we know there were some limitations with Hadoop MapReduce. To Overcome those limitations, Apache Software introduces DAG in Spark. Let’s first study the computation process of MapReduce. It was generally carried in three steps:
  
  1. HDFS is used to read data.
  2. After that, we apply Map and Reduce operations.
  3. Again the result of the computation is written back to HDFS.
  
  In Hadoop, each MapReduce operation is independent of each other. Even HADOOP has no idea of which Map reduce may come next. Therefore, sometimes it is irrelevant to read and write back the immediate result between two map-reduce jobs for some iterations. As a result, disk memory or the memory in stable storage (HDFS) gets wasted.
  
  While we talk about multiple-step, all the jobs are blocked from the beginning till the completion of the previous job. Hence, complex computation can require a long time with small data volume.
  
  But, After DAG introduced in Spark, the execution plan is optimized, e.g. to minimize shuffling data around. Since, a DAG (Directed Acyclic Graph) of consecutive computation stages is formed.
  
  To learn more about DAG, follow the link: Directed Acyclic Graph(DAG) in Apache Spark
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

What is the need for DAG in the Spark?

About DataFlair

Trending Courses in Indore

Trending Courses in Bangalore

Trending Courses in Chennai

Trending Courses in Pune

Trending Courses in Hyderabad

Trending Courses in Delhi NCR