A data stream defines as a data arriving continuously in the form of an unbounded sequence. For further processing, Streaming separates continuously flowing input data into discrete units. It is a low latency processing and analyzing of streaming data.
In the year 2013, Apache Spark Streaming was added to Apache Spark. Through Streaming, we can do fault-tolerant,scalable stream processing of live data streams. From many sources like Kafka, Apache Flume, Amazon Kinesis or TCP sockets, Data ingestion can be possible. Also, by using complex algorithms, processing is possible. That are expressed with high-level functions such as map, reduce, join and window. By the end, processed data can be pushed out to filesystems, databases and live dashboards.
Internally, By Spark streaming, Live input data streams are received and divided into batches. Afterwards, these batches are then processed by the Spark engine to generate the final stream of results in batches.
Discretized Stream or, in short, a Spark DStream is its basic abstraction. That also represents a stream of data divided into small batches. DStreams are built on Spark RDDs, Spark’s core data abstraction. Streaming can aslo integrate with any other Apache Spark components like Spark MLlib and Spark SQL.