How many abstractions are provided by Apache Spark?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 10:21 pm #6458
  
  DataFlair Team
  Spectator
  
  List abstractions of Apache Spark.
  What are the abstractions of Apache Spark?
- September 20, 2018 at 10:21 pm #6459
  
  DataFlair Team
  Spectator
  
  There are several abstractions of Apache Spark:
  
  1. RDD:
  An RDD refers to Resilient Distributed Datasets. RDDs are Read-only partition collection of records. It is Spark’s core abstraction and also a fundamental data structure of Spark. It offers to conduct in-memory computations on large clusters. Even in a fault-tolerant manner. For more detailed insights on RDD.follow link: Spark RDD – Introduction, Features & Operations of RDD
  
  2. DataFrames:
  It is a Dataset organized into named columns. DataFrames are equivalent to the table in a relational database or data frame in R /Python. In other words, we can say it is a relational table with good optimization technique. It is an immutable distributed collection of data. Allowing higher-level abstraction, it allows developers to impose a structure onto a distributed collection of data,. For more detailed insights on DataFrames. refer link:Spark SQL DataFrame Tutorial – An Introduction to DataFrame
  
  3. Spark Streaming:
  It is a Spark’s core extension, which allows Real-time stream processing From several sources. For example Flume and Kafka. To offer a unified, continuous DataFrame abstraction that can be used for interactive and batch queries these two sources work together. It offers scalable, high-throughput and fault-tolerant processing. For more detailed insights on Spark Streaming. refer link: Spark Streaming Tutorial for Beginners
  
  4. GraphX
  It is one more example of specialized data abstraction. It enables developers to analyze social networks. Also, other graphs alongside Excel-like two-dimensional data. For more detailed insights on GaphX. refer link: Apache Spark GraphX
- September 20, 2018 at 10:21 pm #6460
  
  DataFlair Team
  Spectator
  
  RDD-:Spark revolves around the concept of a resilient distributed dataset (RDD),
  which is a fault-tolerant collection of elements that can be operated on in parallel.
  There are two ways to create RDDs:
  1) parallelizing an existing collection in your driver program
  2) referencing a dataset in an external storage system,
  such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
  3)Existing RDDs – Creating RDD from already existing RDDs.
  By applying transformation operation on existing RDDs we can create new RDD.
  
  DataFrame:-DataFrame is an abstraction which gives a schema view of data.
  Which means it gives us a view of data as columns with column name and types info,
  We can think data in data frame like a table in database.
  Like RDD, execution in Dataframe too is lazy triggered .-offers huge performance
  
  Spark Streaming
  
  Spark Streaming is one of those unique features, which have empowered Spark to potentially take the role of Apache Storm. Spark Streaming mainly enables
  you to create analytical and interactive applications for live streaming data. You can do the streaming of the data and then,
  Spark can run its operations from the streamed data itself.
  
  MLLib
  
  MLLib is a machine learning library like Mahout. It is built on top of Spark, and has the provision to support many machine learning algorithms.
  But the point difference with Mahout is that it runs almost 100 times faster than MapReduce.
  
  GraphX
  
  For graphs and graphical computations, Spark has its own Graph Computation Engine, called GraphX. It is similar to other widely used graph
  processing tools or databases.
  
  SparkR
  SparkR is a package for R language to enable R users to leverage the power of Spark from R shell.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

How many abstractions are provided by Apache Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses