What are the different ways of representing data in Spark?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 10:19 pm #6455
  
  DataFlair Team
  Spectator
  
  How is data represented in Spark?
  In how many ways can we represent data in Spark?
- September 20, 2018 at 10:19 pm #6456
  
  DataFlair Team
  Spectator
  
  Basically, there are 3 different ways to represent data in Apache Spark. Either we can represent it through RDD, or we use DataFrames for same or we can also select DataSets to represent our data in Spark. let’s discuss each of them in detail:
  
  1. RDD
  RDD refers to “Resilient Distributed Dataset”. RDD is core abstraction and fundamental data structure of Apache Spark. It is an immutable collection of objects which computes on the different node of the cluster. As we know RDDs are immutable, though we can not make any changes in it we can apply following operations like Transformation and Actions on them.It perform in-memory computations on large clusters in a fault-tolerant manner. Basically, There are three ways to create RDDs in Spark such as – Data in stable storage, other RDDs, and parallelizing already existing collection in driver program.Follow this link to learn Spark RDD in great detail.
  
  2. DataFrame
  In DataFrame, data organized into named columns. This table is as similar as a table in a relational database. DataFrames is also an immutable distributed collection of data. It allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. Follow this link to learn Spark DataFrame in detail.
  
  3. Spark Dataset APIs
  It is an extension of DataFrame API. It provides type-safe, object-oriented programming interface. It takes advantage of Spark’s Catalyst optimizer, by exposing data fields and expressions to a query planner. Follow this link to learn Spark DataSet in detail.
- September 20, 2018 at 10:20 pm #6457
  
  DataFlair Team
  Spectator
  
  Different Ways of representing data in Spark are:-
  
  RDD-:Spark revolves around the concept of a resilient distributed dataset (RDD),
  which is a fault-tolerant collection of elements that can be operated on in parallel.
  There are two ways to create RDDs:
  1) parallelizing an existing collection in your driver program
  2) referencing a dataset in an external storage system,
  such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
  3)Existing RDDs – Creating RDD from already existing RDDs.
  By applying transformation operation on existing RDDs we can create new RDD.
  
  DataFrame:-DataFrame is an abstraction which gives a schema view of data.
  Which means it gives us a view of data as columns with column name and types info,
  We can think data in data frame like a table in database.
  -Like RDD, execution in Dataframe too is lazy triggered .-offers huge performance
  improvement over RDDs because of 2 powerful features it has:
  1. Custom Memory management :Data is stored in off-heap memory in binary format.
  This saves a lot of memory space. Also there is no Garbage Collection overhead involved.
  By knowing the schema of data in advance and storing efficiently in binary format,
  expensive java Serialization is also avoided.
  2. Optimized Execution Plans :Query plans are created for execution using Spark catalyst
  optimiser. After an optimised execution plan is prepared going through some steps,
  the final execution happens internally on RDDs only but thats completely hidden from the
  users.
  
  DataSet:-Datasets in Apache Spark are an extension of DataFrame API which provides
  type-safe, object-oriented programming interface.
  Dataset takes advantage of Spark’s Catalyst optimizer by exposing expressions and
  data fields to a query planner.
  
  Dataset and DataFrame internally does final execution on RDD objects only but the difference
  is users do not write code to create the RDD collections and have no control as such over RDDs.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What are the different ways of representing data in Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses