Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › How many abstractions are provided by Apache Spark?
- This topic has 2 replies, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 10:21 pm #6458DataFlair TeamSpectator
List abstractions of Apache Spark.
What are the abstractions of Apache Spark? -
September 20, 2018 at 10:21 pm #6459DataFlair TeamSpectator
There are several abstractions of Apache Spark:
1. RDD:
An RDD refers to Resilient Distributed Datasets. RDDs are Read-only partition collection of records. It is Spark’s core abstraction and also a fundamental data structure of Spark. It offers to conduct in-memory computations on large clusters. Even in a fault-tolerant manner. For more detailed insights on RDD.follow link: Spark RDD – Introduction, Features & Operations of RDD2. DataFrames:
It is a Dataset organized into named columns. DataFrames are equivalent to the table in a relational database or data frame in R /Python. In other words, we can say it is a relational table with good optimization technique. It is an immutable distributed collection of data. Allowing higher-level abstraction, it allows developers to impose a structure onto a distributed collection of data,. For more detailed insights on DataFrames. refer link:Spark SQL DataFrame Tutorial – An Introduction to DataFrame3. Spark Streaming:
It is a Spark’s core extension, which allows Real-time stream processing From several sources. For example Flume and Kafka. To offer a unified, continuous DataFrame abstraction that can be used for interactive and batch queries these two sources work together. It offers scalable, high-throughput and fault-tolerant processing. For more detailed insights on Spark Streaming. refer link: Spark Streaming Tutorial for Beginners4. GraphX
It is one more example of specialized data abstraction. It enables developers to analyze social networks. Also, other graphs alongside Excel-like two-dimensional data. For more detailed insights on GaphX. refer link: Apache Spark GraphX -
September 20, 2018 at 10:21 pm #6460DataFlair TeamSpectator
RDD-:Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs:
1) parallelizing an existing collection in your driver program
2) referencing a dataset in an external storage system,
such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
3)Existing RDDs – Creating RDD from already existing RDDs.
By applying transformation operation on existing RDDs we can create new RDD.DataFrame:-DataFrame is an abstraction which gives a schema view of data.
Which means it gives us a view of data as columns with column name and types info,
We can think data in data frame like a table in database.
Like RDD, execution in Dataframe too is lazy triggered .-offers huge performanceSpark Streaming
Spark Streaming is one of those unique features, which have empowered Spark to potentially take the role of Apache Storm. Spark Streaming mainly enables
you to create analytical and interactive applications for live streaming data. You can do the streaming of the data and then,
Spark can run its operations from the streamed data itself.MLLib
MLLib is a machine learning library like Mahout. It is built on top of Spark, and has the provision to support many machine learning algorithms.
But the point difference with Mahout is that it runs almost 100 times faster than MapReduce.GraphX
For graphs and graphical computations, Spark has its own Graph Computation Engine, called GraphX. It is similar to other widely used graph
processing tools or databases.SparkR
SparkR is a package for R language to enable R users to leverage the power of Spark from R shell.
-
-
AuthorPosts
- You must be logged in to reply to this topic.