Live instructor-led & Self-paced Online Certification Training Courses (Big Data, Hadoop, Spark) › Forums › Apache Spark › What are the different ways of representing data in Spark?
September 20, 2018 at 10:19 pm #6455
How is data represented in Spark?
In how many ways can we represent data in Spark?
September 20, 2018 at 10:19 pm #6456
Basically, there are 3 different ways to represent data in Apache Spark. Either we can represent it through RDD, or we use DataFrames for same or we can also select DataSets to represent our data in Spark. let’s discuss each of them in detail:
RDD refers to “Resilient Distributed Dataset”. RDD is core abstraction and fundamental data structure of Apache Spark. It is an immutable collection of objects which computes on the different node of the cluster. As we know RDDs are immutable, though we can not make any changes in it we can apply following operations like Transformation and Actions on them.It perform in-memory computations on large clusters in a fault-tolerant manner. Basically, There are three ways to create RDDs in Spark such as – Data in stable storage, other RDDs, and parallelizing already existing collection in driver program.Follow this link to learn Spark RDD in great detail.
In DataFrame, data organized into named columns. This table is as similar as a table in a relational database. DataFrames is also an immutable distributed collection of data. It allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. Follow this link to learn Spark DataFrame in detail.
3. Spark Dataset APIs
It is an extension of DataFrame API. It provides type-safe, object-oriented programming interface. It takes advantage of Spark’s Catalyst optimizer, by exposing data fields and expressions to a query planner. Follow this link to learn Spark DataSet in detail.
September 20, 2018 at 10:20 pm #6457
Different Ways of representing data in Spark are:-
RDD-:Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs:
1) parallelizing an existing collection in your driver program
2) referencing a dataset in an external storage system,
such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
3)Existing RDDs – Creating RDD from already existing RDDs.
By applying transformation operation on existing RDDs we can create new RDD.
DataFrame:-DataFrame is an abstraction which gives a schema view of data.
Which means it gives us a view of data as columns with column name and types info,
We can think data in data frame like a table in database.
-Like RDD, execution in Dataframe too is lazy triggered .-offers huge performance
improvement over RDDs because of 2 powerful features it has:
1. Custom Memory management :Data is stored in off-heap memory in binary format.
This saves a lot of memory space. Also there is no Garbage Collection overhead involved.
By knowing the schema of data in advance and storing efficiently in binary format,
expensive java Serialization is also avoided.
2. Optimized Execution Plans :Query plans are created for execution using Spark catalyst
optimiser. After an optimised execution plan is prepared going through some steps,
the final execution happens internally on RDDs only but thats completely hidden from the
DataSet:-Datasets in Apache Spark are an extension of DataFrame API which provides
type-safe, object-oriented programming interface.
Dataset takes advantage of Spark’s Catalyst optimizer by exposing expressions and
data fields to a query planner.
Dataset and DataFrame internally does final execution on RDD objects only but the difference
is users do not write code to create the RDD collections and have no control as such over RDDs.
- You must be logged in to reply to this topic.