Spark Dataset Tutorial – Introduction to Apache Spark Dataset

1. Objective

In This blog on Apache Spark dataset, you can read all about what is dataset in Spark. Why DataSet needed, what is encoder and what is their significance in the dataset? You will get the answer to all these questions in this blog. We will also cover the features of the dataset in Apache Spark and How to create a dataset in this Spark tutorial.

2. Introduction to Spark Dataset

Dataset is a data structure in SparkSQL which is strongly typed and is a map to a relational schema. It represents structured queries with encoders. It is an extension to dataframe API. Spark Dataset provides both type safety and object-oriented programming interface. We encounter the release of the dataset in Spark 1.6.

The encoder is primary concept in serialization and deserialization (SerDe) framework in Spark SQL. Encoders translate between JVM objects and Spark’s internal binary format. Spark has built-in encoders which are very advanced. They generate bytecode to interact with off-heap data.

An encoder provides on-demand access to individual attributes without having to de-serialize an entire object. To make input output time and space efficient, Spark SQL uses SerDe framework. Since encoder knows the schema of record, it can achieve serialization and deserialization.

Spark Dataset is structured and lazy query expression that triggers on the action. Internally dataset represents logical plan. The logical plan tells the computational query that we need to produce the data. the logical plan is a base catalyst query plan for the logical operator to form a logical query plan. When we analyze this and resolve we can form a physical query plan.

Dataset clubs the features of RDD and DataFrame. It provides:

  • The convenience of RDD.
  • Performance optimization of DataFrame.
  • Static type-safety of Scala.

Thus, Datasets provides a more functional programming interface to work with structured data.

3. Need of Dataset in Spark

To overcome the limitations of RDD and Dataframe, Dataset emerged. In DataFrame, there was no provision for compile time type safety. Data cannot be altered without knowing its structure. In RDD there was no automatic optimization. So for optimization, we do it manually when needed.

4. Features of Dataset in Spark

After having introduction to dataSet, let’s now discuss various features of Spark Dataset-

a. Optimized Query

Dataset in Spark provides Optimized query using Catalyst Query Optimizer and Tungsten. Catalyst Query Optimizer is an execution-agnostic framework. It represents and manipulates a data-flow graph. Data flow graph is a tree of expressions and relational operators. By optimizing the Spark job Tungsten improves the execution. Tungsten emphasizes on the hardware architecture of the platform on which Apache Spark runs.

b. Analysis at compile time

Using Dataset we can check syntax and analysis at compile time. It is not possible using Dataframe, RDDs or regular SQL queries.

c. Persistent Storage

Spark Datasets are both serializable and Queryable. Thus, we can save it to persistent storage.

d. Inter-convertible

We can convert the Type-safe dataset to an “untyped” DataFrame. To do this task Datasetholder provide three methods for conversion from Seq[T] or RDD[T] types to Dataset[T]:

  • toDS(): Dataset[T]
  • toDF(): DataFrame
  • toDF(colNames: String*): DataFrame

e. Faster Computation

The implementation of Dataset is much faster than the RDD implementation. Thus increases the performance of the system. For same performance using the RDD, the user manually considers how to express computation that parallelizes optimally.

f. Less Memory Consumption

While caching, it creates a more optimal layout. Since Spark knows the structure of data in the dataset.

g. Single API for Java and Scala

It provides a single interface for Java and Scala. This unification ensures we can use Scala interface, code examples from both languages. It also reduces the burden of libraries. As libraries have no longer to deal with two different type of inputs.

5. Creating Dataset

To create Dataset we need:

a. SparkSession

SparkSession is the entry point to the SparkSQL. It is a very first object that we create while developing Spark SQL applications using fully typed Dataset data abstractions. Using SparkSession.Builder, we can create an instance of SparkSession. And can stop SparkSession using stop method (spark.stop).  

b. QueryExecution

We represent structured query execution pipeline of the dataset using QueryExecution. To access QueryExecution of a Dataset use QueryExecution attribute. By executing a logical plan in Spark Session we get QueryExecution.

executePlan(plan: LogicalPlan): QueryExecution

executePlan executes the input LogicalPlan to produce a QueryExecution in the current SparkSession.

c. Encoder

An encoder provides conversion between tabular representation and JVM objects. With the help of encoder, we serialize the object. Encoder serializes objects for processing or transmitting over the network encoders.

6. Conclusion

In conclusion to Dataset, we can say it is strongly typed data structure in Apache Spark. It represents structured queries. It fuses together the functionality of RDD and DataFrame. We can generate the optimized query using Dataset. Dataset lessens the memory consumption and provides a single API for both Java and Scala.

If you like this post and feel that I have missed any point, so, do let me know by leaving a comment.

See Also-

Leave a comment

Your email address will not be published. Required fields are marked *