What is Dataset in Apache Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #6437
      DataFlair TeamDataFlair Team
      Spectator

      Define Dataset in Apache Spark.
      Describe dataset in Apache Spark?

    • #6438
      DataFlair TeamDataFlair Team
      Spectator

      Introduction to DataSets

      In Apache Spark, Datasets are an extension of DataFrame API. It offers object-oriented programming interface. Through Spark SQL, it takes advantage of Spark’s Catalyst optimizer by exposing e data fields to a query planner.

      In SparkSQL, Dataset is a data structure which is strongly typed and is a map to a relational schema. Also, represents structured queries with encoders. DataSet has been released in Spark 1.6.

      In serialization and deserialization (SerDe) framework, encoder turns out as a primary concept in Spark SQL. Encoders handle all translation process between JVM objects and Spark’s internal binary format. In Spark, we have built-in encoders those are very advanced. Even they generate bytecode to interact with off-heap data.

      On-demand access to individual attributes without having to de-serialize an entire object is provided by an encoder. Spark SQL uses SerDe framework, to make input-output time and space efficient. Due to encoder knows the schema of record, it became possible to achieve serialization as well as deserialization.

      Spark Dataset is structured and lazy query expression(lazy Evolution) that triggers the action. Internally dataset represents logical plan. The logical plan tells the computational query that we need to produce the data. the logical plan is a base catalyst query plan for the logical operator to form a logical query plan. When we analyze this and resolve we can form a physical query plan.

      As Dataset introduced after RDD and DataFrame, it clubs the features of both. It offers following similar features:

      1. The convenience of RDD.
      2. Performance optimization of DataFrame.
      3. Static type-safety of Scala.

      Hence, we have observed that Datasets provides a more functional programming interface to work with structured data.

      To know more detailed information about DataSets, refer link: Spark Dataset Tutorial – Introduction to Apache Spark Dataset

Viewing 1 reply thread
  • You must be logged in to reply to this topic.