How many abstractions are provided by Apache Spark?

Viewing 2 reply threads
  • Author
    • #6458
      DataFlair Team

      List abstractions of Apache Spark.
      What are the abstractions of Apache Spark?

    • #6459
      DataFlair Team

      There are several abstractions of Apache Spark:

      1. RDD:
      An RDD refers to Resilient Distributed Datasets. RDDs are Read-only partition collection of records. It is Spark’s core abstraction and also a fundamental data structure of Spark. It offers to conduct in-memory computations on large clusters. Even in a fault-tolerant manner. For more detailed insights on RDD.follow link: Spark RDD – Introduction, Features & Operations of RDD

      2. DataFrames:
      It is a Dataset organized into named columns. DataFrames are equivalent to the table in a relational database or data frame in R /Python. In other words, we can say it is a relational table with good optimization technique. It is an immutable distributed collection of data. Allowing higher-level abstraction, it allows developers to impose a structure onto a distributed collection of data,. For more detailed insights on DataFrames. refer link:Spark SQL DataFrame Tutorial – An Introduction to DataFrame

      3. Spark Streaming:
      It is a Spark’s core extension, which allows Real-time stream processing From several sources. For example Flume and Kafka. To offer a unified, continuous DataFrame abstraction that can be used for interactive and batch queries these two sources work together. It offers scalable, high-throughput and fault-tolerant processing. For more detailed insights on Spark Streaming. refer link: Spark Streaming Tutorial for Beginners

      4. GraphX
      It is one more example of specialized data abstraction. It enables developers to analyze social networks. Also, other graphs alongside Excel-like two-dimensional data. For more detailed insights on GaphX. refer link: Apache Spark GraphX

    • #6460
      DataFlair Team

      RDD-:Spark revolves around the concept of a resilient distributed dataset (RDD),
      which is a fault-tolerant collection of elements that can be operated on in parallel.
      There are two ways to create RDDs:
      1) parallelizing an existing collection in your driver program
      2) referencing a dataset in an external storage system,
      such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
      3)Existing RDDs – Creating RDD from already existing RDDs.
      By applying transformation operation on existing RDDs we can create new RDD.

      DataFrame:-DataFrame is an abstraction which gives a schema view of data.
      Which means it gives us a view of data as columns with column name and types info,
      We can think data in data frame like a table in database.
      Like RDD, execution in Dataframe too is lazy triggered .-offers huge performance

      Spark Streaming

      Spark Streaming is one of those unique features, which have empowered Spark to potentially take the role of Apache Storm. Spark Streaming mainly enables
      you to create analytical and interactive applications for live streaming data. You can do the streaming of the data and then,
      Spark can run its operations from the streamed data itself.


      MLLib is a machine learning library like Mahout. It is built on top of Spark, and has the provision to support many machine learning algorithms.
      But the point difference with Mahout is that it runs almost 100 times faster than MapReduce.


      For graphs and graphical computations, Spark has its own Graph Computation Engine, called GraphX. It is similar to other widely used graph
      processing tools or databases.

      SparkR is a package for R language to enable R users to leverage the power of Spark from R shell.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.