What is DStream in Spark streaming?

Viewing 0 reply threads
  • Author
    Posts
    • #6428
      DataFlair TeamDataFlair Team
      Spectator

      To understand the DStream better, let’s begin with a brief introduction on Spark Streaming.

      Introduction on Spark Streaming
      Spark Streaming was added to Apache Spark in 2013, an extension of the core Spark API that provides scalable, high-throughput and fault-tolerant stream processing of live data streams. From many sources like Kafka, Apache Flume, Amazon Kinesis or TCP sockets, Data ingestion can be done. Also, by using complex algorithms that are expressed with high-level functions, processing can be done. High-level functions such as map, reduce, join and window. Ultimately, processed data can be pushed out to filesystems, databases as well as live dashboards.

      Discretized Stream or, in short, a Spark DStream is key abstraction is Apache Spark. That represents a stream of data divided into small batches. DStreams are built on Spark RDDs, Spark’s core data abstraction. It also allows Streaming in Spark to integrate with any other Apache Spark components like Spark SQL and Spark MLlib.

      DStream 
      As discussed earlier, Spark DStream (Discretized Stream) is the basic abstraction of Spark Streaming. It is a continuous stream of data. From various sources like Kafka, Flume, Kinesis, or TCP sockets, it receives input. Also, it can be a data stream generated by transforming the input stream. DStream is a continuous stream of RDD (Spark abstraction), at its core. From the certain interval, every RDD in DStream contains data.

      If we apply any operation on a DStream, it applies to all the underlying RDDs. DStream covers all the details. It provides the developer with a high-level API for convenience. As a result, Spark DStream facilitates working with streaming data.

      To know more about DStream, go through link: Apache Spark DStream (Discretized Streams)

Viewing 0 reply threads
  • You must be logged in to reply to this topic.