Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › What is Dataset in Apache Spark?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 10:03 pm #6437DataFlair TeamSpectator
Define Dataset in Apache Spark.
Describe dataset in Apache Spark? -
September 20, 2018 at 10:04 pm #6438DataFlair TeamSpectator
Introduction to DataSets
In Apache Spark, Datasets are an extension of DataFrame API. It offers object-oriented programming interface. Through Spark SQL, it takes advantage of Spark’s Catalyst optimizer by exposing e data fields to a query planner.
In SparkSQL, Dataset is a data structure which is strongly typed and is a map to a relational schema. Also, represents structured queries with encoders. DataSet has been released in Spark 1.6.
In serialization and deserialization (SerDe) framework, encoder turns out as a primary concept in Spark SQL. Encoders handle all translation process between JVM objects and Spark’s internal binary format. In Spark, we have built-in encoders those are very advanced. Even they generate bytecode to interact with off-heap data.
On-demand access to individual attributes without having to de-serialize an entire object is provided by an encoder. Spark SQL uses SerDe framework, to make input-output time and space efficient. Due to encoder knows the schema of record, it became possible to achieve serialization as well as deserialization.
Spark Dataset is structured and lazy query expression(lazy Evolution) that triggers the action. Internally dataset represents logical plan. The logical plan tells the computational query that we need to produce the data. the logical plan is a base catalyst query plan for the logical operator to form a logical query plan. When we analyze this and resolve we can form a physical query plan.
As Dataset introduced after RDD and DataFrame, it clubs the features of both. It offers following similar features:
1. The convenience of RDD.
2. Performance optimization of DataFrame.
3. Static type-safety of Scala.Hence, we have observed that Datasets provides a more functional programming interface to work with structured data.
To know more detailed information about DataSets, refer link: Spark Dataset Tutorial – Introduction to Apache Spark Dataset
-
-
AuthorPosts
- You must be logged in to reply to this topic.