This tutorial will help you in understanding why Apache Flink came into existence, what is the need of Apache Flink, Apache Flink features that distinguish it from other technologies and Why companies are using Apache Flink to fulfill their requirements.
Before learning why Flink, you should understand Flink key concepts.
Apache flink reduces the complexity that has been faced by other distributed data driven engines. It achieves this feature by integrating query optimization, concepts from database systems and efficient parallel in-memory and out-of-core algorithms, with the MapReduce framework. This creates Comparison between Flink, Spark and MapReduce. Though Flink is compatible with Hadoop, still it is the most demanding technology now because of the following main features of Flink.
a. True streaming engine
Infrastructure based on streaming data not only enables new type of latency-critical applications and give more actual operational insights through more updated views of the processes. They are more potential in making classical data warehousing setups radically more simple and flexible.
A crucial piece of a streaming infrastructure is a stream processor that can deliver high throughput, low latency and strong consistency guarantees even in the presence of stateful computation. Apache Flink is a scalable stream processing engine that provides exactly this combination of properties.
Flink uses one common runtime for data streaming applications and batch processing applications.
We eventually want to arrive at a certain mix of desiderata for streaming application. These are the following:
- Exactly-once guarantees: After a failure, state in stateful operators should be correctly restored
- Low latency: Many applications require sub-second latency as the lower the better.
- High throughput: As data grows, pushing large amounts of data through the pipeline is crucial
- Powerful computation model: The framework should offer a programming model that does not restrict the user and allows a wide variety of applications
- Low overhead: It is required for fault tolerance mechanism in absence of failures
- Flow control: Backpressure from slow operators should be naturally absorbed by the system and the data sources to avoid crashes or degrading performance due to slow consumers
We have several approaches of fault-tolerant streaming architectures, starting from record acknowledgements to micro-batching and distributed snapshots.
But do you know what they exactly are? Let us see them 1 by 1:
Apache storm: It uses a mechanism of upstream backup and record acknowledgement to guarantee that messages are reprocessed after a failure. Storm does not guarantee state consistency. Other problem with Storm’s mechanism is low throughput and problems with flow control, as the acknowledgment mechanism often falsely classifies failures under backpressure. This led to the next evolution of streaming architectures that is based on micro-batching.
Apache Spark: The next fault tolerant streaming architectures was the idea of micro-batching or stream discretization. The idea is that in order to overcome the complexity and overhead of record-level synchronization that comes with the model of continuous operators that process and buffer records, a continuous computation is broken down in a series of small, atomic batch jobs called micro-batches. Each micro-batch may either succeed or fail. At a failure, the latest micro-batch can be simply recomputed.
Read Differences between Spark and Storm for more detailed description.
Systems based on micro-batching can achieve quite a few of the desiderata outlined above (exactly-once guarantees, high throughput).
Then what is the problem?
- Programming model: Spark streaming changes the programming model from streaming to micro batching. This means users can no longer window data in periods other than multiples of the checkpoint interval. This makes Spark better than Hadoop but it does not support count based or session windows needed by many applications. On the other hand, more flexibility is provided by the pure streaming model with continuous operators that can mutate state.
- Flow control: Micro-batch architectures that use time-based batches have an inherent problem with backpressure effects. The micro batch will take longer than configured If processing takes longer in downstream operations (e.g., due to a compute-intensive operator, or a slow sink) than in the batching operator (typically the source). This leads either to more and more batches queueing up or to a growing micro-batch size.
- Latency: Micro-batching obviously limits the latency of jobs to that of the micro-batch. For simple applications sub-second batch latency is feasible while applications with multiple networks shuffle easily bring the latency up to multiple seconds.
Here’s the deal:
Apache Flink’s snapshot algorithm is based on a technique that was introduced in 1985 by Chandy and Lamport, to draw consistent snapshots of the current state of a distributed system without missing information and without recording duplicates. This architecture thus combines the benefits of following a true continuous operator model (low latency, flow control, and true streaming programming model) with high throughput, and exactly once guarantees provable by the Chandy-Lamport algorithm.
b. Custom Memory Manager:
Flink implements its own memory management inside the JVM. Its features are:
- C++ style memory management inside the jvm.
- User data stored in serialized bytes array in JVM.
- Memory is allocated, de-allocated and used strictly using on internal buffer pool implementation.
Advantages of custom memory manager are:
- Flink will not throw an OutOfMemoy Exception on you.
- Reduction of garbage collection.
- Very efficient disk spilling and network transfer.
- No need for runtime tuning.
- More reliable and stable performance
c. Native closed-loop iteration operators:
Flink has dedicated support for iterative computations. It iterates data by using streaming Architecture. The concept of iterative algorithm is tightly bounded into flink query optimizer. Flink’s pipelined architecture allows processing the streaming data faster with lower latency.
d. Automatic cost-based optimizer:
In Flink, Batch programs are automatically optimized to exploit situations where expensive operations (like shuffles and sorts) can be avoided, and when intermediate data should be cached.
e. Ease to use:
The Flink APIs make it easier to use than programming for Mapreduce and it is easier to test as compared to hadoop.
f. Unified Framework:
Big data is getting matured with Flink and Flink is a unified framework which allows building a single data workflow that hold streaming, batch, SQL and Machine learning. Flink can process graph using its own Gelly library and use Machine learning algorithm from its own FlinkML library. Flink also supports interactive queries and iterative algorithm not well served by Hadoop Mapreduce and extends mapreduce model with new operators like join, cross union, iterate, iterate delta, cogroup etc.
Want to know the best part?
Recent advancements in big data shows that industry wants to use single unified platform for various Big data requirements. As compared to multiple frameworks, single platform evolves faster. It’s very easy to develop, manage and maintain single platform. We need to master single technology to solve all big data problems. Flink is a generalized framework which replaces all big data stack. Apache Flink is a general purpose cluster computing tool, which alone can handle batch processing, interactive processing, Stream processing, Iterative processing, in-memory processing, and graph processing.