Need of DAG in Spark
As we know there were some limitations with Hadoop MapReduce. To Overcome those limitations, Apache Software introduces DAG in Spark. Let’s first study the computation process of MapReduce. It was generally carried in three steps:
1. HDFS is used to read data.
2. After that, we apply Map and Reduce operations.
3. Again the result of the computation is written back to HDFS.
In Hadoop, each MapReduce operation is independent of each other. Even HADOOP has no idea of which Map reduce may come next. Therefore, sometimes it is irrelevant to read and write back the immediate result between two map-reduce jobs for some iterations. As a result, disk memory or the memory in stable storage (HDFS) gets wasted.
While we talk about multiple-step, all the jobs are blocked from the beginning till the completion of the previous job. Hence, complex computation can require a long time with small data volume.
But, After DAG introduced in Spark, the execution plan is optimized, e.g. to minimize shuffling data around. Since, a DAG (Directed Acyclic Graph) of consecutive computation stages is formed.