Spark SQL DataFrame Tutorial – An Introduction to DataFrame


1. Objective

In this Spark SQL DataFrame tutorial, we will learn what is DataFrame in Apache Spark and the need of Spark Dataframe. The tutorial covers the limitation of Spark RDD and How DataFrame overcomes those limitations. How to create DataFrame in Spark, Various Features of DataFrame like Custom Memory Management, Optimized Execution plan, and its limitations are also covers in this Spark tutorial.

A complete tutorial for Apache Spark SQL DataFrame.

2. Introduction to Spark SQL DataFrame

DataFrame appeared in Spark Release 1.3.0. We can term DataFrame as Dataset organized into named columns. DataFrames are similar to the table in a relational database or data frame in R /Python. It can be said as a relational table with good optimization technique.

The idea behind DataFrame is it allows processing of a large amount of structured data. DataFrame contains rows with Schema. The schema is the illustration of the structure of data.

DataFrame in Apache Spark prevails over RDD but contains the features of RDD as well. The features common to RDD and DataFrame are immutability, in-memory, resilient, distributed computing capability. It allows the user to impose the structure onto a distributed collection of data. Thus provides higher level abstraction.

We can build DataFrame from different data sources. For Example structured data file, tables in Hive, external databases or existing RDDs. The Application Programming Interface (APIs) of DataFrame is available in various languages. Examples include Scala, Java, Python, and R.

Both in Scala and Java, we represent DataFrame as Dataset of rows. In the Scala API, DataFrames are type alias of Dataset[Row]. In Java API, the user uses Dataset<Row> to represent a DataFrame.

3. Why DataFrame?

DataFrame is one step ahead of RDD. Since it provides memory management and optimized execution plan.

a. Custom Memory Management: This is also known as Project Tungsten. A lot of memory is saved as the data is stored in off-heap memory in binary format. Apart from this, there is no Garbage Collection overhead. Expensive Java serialization is also avoided. Since the data is stored in binary format and the schema of memory is known.

b. Optimized Execution plan: This is also known as the query optimizer. Using this, an optimized execution plan is created for the execution of a query. Once the optimized plan is created final execution takes place on RDDs of Spark.

4. Features of Apache Spark DataFrame

Some of the limitations of Spark RDD were-

  • It does not have any built-in optimization engine.
  • There is no provision to handle structured data.

Thus, to overcome these limitations the picture of DataFrame came into existence. Some of the key features of DataFrame in Spark are:

i. DataFrame is a distributed collection of data organized in named column. It is equivalent to the table in RDBMS.

ii. It can deal with both structured and unstructured data formats. For Example Avro, CSV, elastic search, and Cassandra. It also deals with storage systems HDFS, HIVE tables, MySQL, etc.

iii. Catalyst supports optimization. It has general libraries to represent trees. DataFrame uses Catalyst tree transformation in four phases:

  • Analyze logical plan to solve references
  • Logical plan optimization
  • Physical planning
  • Code generation to compile part of a query to Java bytecode.

You can refer this guide to learn Spark SQL optimization phases in detail.

iv. The DataFrame API’s are available in various programming languages. For example Java, Scala, Python, and R.

v. It provides Hive compatibility. We can run unmodified Hive queries on existing Hive warehouse.

vi. It can scale from kilobytes of data on the single laptop to petabytes of data on a large cluster.

vii. DataFrame provides easy integration with Big data tools and framework via Spark core.

5. Creating DataFrames in Apache Spark

To all the functionality of Spark, SparkSession class is the entry point. For the creation of basic SparkSession just use

SparkSession.builder()

Using Spark Session, an application can create DataFrame from an existing RDD, Hive table or from Spark data sources. Spark SQL can operate on the variety of data sources using DataFrame interface. Using Spark SQL DataFrame we can create a temporary view. In the temporary view of dataframe, we can run the SQL query on the data.

6. Limitations of DataFrame in Spark

  • Spark SQL DataFrame API does not have provision for compile time type safety. So, if the structure is unknown, we cannot manipulate the data.
  • Once the domain object is converted into dataframe, the regeneration of domain object is not possible.

7. Conclusion

Hence, DataFrame API in Spark SQL improves the performance and scalability of Spark. It avoids the garbage-collection cost of constructing individual objects for each row in the dataset.

The Spark DataFrame API is different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. This DataFrame API is good for developers who are familiar with building query plans. It is not good for the majority of developers.

To Play with DataFrame in spark, install Apache Spark in Standalone mode and Spark installation in the multi-node cluster.

Leave a comment

Your email address will not be published. Required fields are marked *