What are the abstractions of Apache Spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 9:36 pm #6384
  
  DataFlair Team
  Spectator
  
  list abstractions of Apache Spark.
  How many abstractions are provided by Apache Spark?
- September 20, 2018 at 9:37 pm #6385
  
  DataFlair Team
  Spectator
  
  RDD is the core abstraction in Apache Spark. It is an immutable, fault-tolerant
  distributed collection of statically typed objects that are usually stored in-memory. RDD API offers simple operations such as map, reduce, and filter that can be composed in arbitrary ways.
  
  DataFrame abstraction is built on top of RDD and it adds “named” columns. So, a Spark DataFrame has rows of named columns similar to relational database tables and DataFrames in R and Python (pandas).
  
  Apart from RDD as well as DataFrame, there are some more specialized data abstractions that work on top of these abstractions. For example, Streaming APIs. These are introduced to process real-time streaming data from various sources such as Flume and Kafka. These APIs work together to provide data engineers with a unified, continuous DataFrame abstraction that can be used for interactive and batch queries. GraphFrame is one more example of specialized data abstraction. It enables developers to analyze social networks. Also, other graphs alongside Excel-like two-dimensional data.
  
  learn more about Spark’s core abstraction, follow the link,
  
  1. Spark RDD
  
  2. DataFrame
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.