{"id":2576,"date":"2017-05-10T12:49:16","date_gmt":"2017-05-10T12:49:16","guid":{"rendered":"http:\/\/data-flair.training\/blogs\/?p=2576"},"modified":"2019-02-19T12:01:28","modified_gmt":"2019-02-19T06:31:28","slug":"apache-spark-rdd-vs-dataframe-vs-dataset","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/","title":{"rendered":"Apache Spark RDD vs DataFrame vs DataSet"},"content":{"rendered":"<h2>1. Objective<\/h2>\n<p>This Spark tutorial will provide you the detailed feature wise comparison between<strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-tutorial-quickstart-introduction\/\"> Apache Spark <\/a><\/strong>RDD vs DataFrame vs DataSet. We will cover the brief introduction of Spark APIs i.e. RDD, DataFrame and Dataset, Differences between these Spark API based on various features. For example, Data Representation, Immutability, and Interoperability etc. We will also illustrate, where to use RDD, DataFrame API, and Dataset API of Spark.<br \/>\nLearn easy steps to<strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-installation-in-standalone-mode\/\"> Install Apache Spark\u00a0on the single node<\/a><\/strong> and on<strong> <a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-installation-on-multi-node-cluster-step-by-step-guide\/\">Multi-node cluster<\/a><\/strong>.<\/p>\n<div id=\"attachment_42335\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-42335\" class=\"size-full wp-image-42335\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1.jpg\" alt=\"Apache Spark RDD vs DataFrame vs DataSet\" width=\"1200\" height=\"628\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1.jpg 1200w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1-768x402.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1-1024x536.jpg 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1-520x272.jpg 520w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><p id=\"caption-attachment-42335\" class=\"wp-caption-text\">Apache Spark RDD vs DataFrame vs DataSet<\/p><\/div>\n<h2>2. Apache Spark APIs &#8211;\u00a0RDD, DataFrame, and DataSet<\/h2>\n<p>Before starting the comparison between Spark\u00a0RDD vs DataFrame vs Dataset, let us see RDDs, DataFrame\u00a0and Datasets in Spark:<\/p>\n<ul>\n<li><strong>Spark RDD APIs &#8211;\u00a0<\/strong>An<strong>\u00a0<\/strong>RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations on large clusters in a <strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-streaming-fault-tolerance\/\">fault-tolerant<\/a><\/strong> manner. Thus, speed up the task. Follow this link to<strong> <a href=\"http:\/\/data-flair.training\/blogs\/rdd-in-apache-spark\/\">learn Spark RDD in great detail<\/a>.<\/strong><\/li>\n<li><strong>Spark Dataframe APIs &#8211;\u00a0<\/strong>Unlike an RDD, data organized into named columns. For example a table in a relational database.\u00a0It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. Follow this link to<strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-sql-dataframe-tutorial\/\"> learn Spark DataFrame in detail.<\/a><\/strong><\/li>\n<li><strong>Spark Dataset APIs &#8211;\u00a0<\/strong>Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface. Dataset takes advantage of Spark\u2019s Catalyst optimizer by exposing expressions and data fields to a query planner. Follow this link to <strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-dataset-tutorial\/\">learn Spark DataSet in detail.<\/a><\/strong><\/li>\n<\/ul>\n<h2>3. RDD vs Dataframe vs DataSet in Apache Spark<\/h2>\n<p>Let us now learn the feature wise difference between RDD vs DataFrame vs DataSet API in Spark:<\/p>\n<h3>3.1. Spark Release<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0The <strong><a href=\"https:\/\/data-flair.training\/blogs\/rdd-in-apache-spark\/\">RDD<\/a><\/strong> APIs have been on Spark since the 1.0 release.<\/li>\n<li><strong>DataFrames &#8211;<\/strong>\u00a0Spark introduced DataFrames in Spark 1.3 release.<\/li>\n<li><strong>DataSet &#8211;\u00a0<\/strong>Spark introduced Dataset in Spark 1.6 release.<\/li>\n<\/ul>\n<h3>3.2. Data Representation<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0RDD is a distributed collection of data elements spread across many machines in the cluster. RDDs are a set of Java or <strong><a href=\"http:\/\/data-flair.training\/blogs\/why-you-should-learn-scala-introductory-tutorial\/\">Scala<\/a><\/strong> objects representing data.<\/li>\n<li><strong>DataFrame &#8211;<\/strong>\u00a0A DataFrame is a distributed collection of data organized into named columns. It is conceptually equal to a table in a relational database.<\/li>\n<li><strong>DataSet &#8211;<\/strong>\u00a0It is an extension of DataFrame API that provides the functionality of &#8211; type-safe, object-oriented programming interface of the RDD API and performance benefits of the Catalyst query optimizer and off heap storage mechanism of a DataFrame API.<\/li>\n<\/ul>\n<h3>3.3. Data Formats<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0It can easily and efficiently process data which is structured as well as unstructured. But like Dataframe and DataSets, RDD does not infer the schema of the ingested data and requires the user to specify it.<\/li>\n<li><strong>DataFrame &#8211;<\/strong>\u00a0It works only on structured and semi-structured data. It organizes the data in the named column. DataFrames allow the Spark to manage schema.<\/li>\n<li><strong>DataSet &#8211;<\/strong> It also efficiently processes structured and unstructured data. It represents data in the form of JVM objects of row or a collection of row object. Which is represented in tabular forms through encoders.<\/li>\n<\/ul>\n<h3>3.4. Data Sources API<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong> Data source API allows that an RDD could come from any data source e.g. text file, a<br \/>\ndatabase via JDBC etc. and easily handle data with no predefined structure.<\/li>\n<li><strong>DataFrame &#8211;<\/strong> Data source API allows Data processing in different formats (AVRO, CSV, JSON, and storage system<strong> <a href=\"http:\/\/data-flair.training\/blogs\/apache-hadoop-hdfs-introduction-tutorial\/\">HDFS<\/a><\/strong>, <strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-hive-tutorial-introductory-guide\/\">HIVE <\/a><\/strong>tables, MySQL). It can read and write from various data sources that are mentioned above.<\/li>\n<li><strong>DataSet &#8211;<\/strong> Dataset API of spark also support data from different sources.<\/li>\n<\/ul>\n<h3>3.5. Immutability and Interoperability<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong> RDDs contains the collection of records which are partitioned. The basic unit of parallelism in an RDD is called partition. Each partition is one logical division of data which is immutable and created through some transformation on existing partitions. Immutability helps to achieve consistency in computations. We can move from RDD to DataFrame (If RDD is in tabular format) by <strong>toDF()<\/strong> method or we can do the reverse by the <strong>.rdd<\/strong> method. Learn various<strong><a href=\"http:\/\/data-flair.training\/blogs\/rdd-transformations-actions-apis-apache-spark\/\"> RDD Transformations and Actions APIs with examples<\/a><\/strong>.<\/li>\n<li><strong>DataFrame &#8211;\u00a0<\/strong>After transforming into DataFrame one\u00a0cannot regenerate a domain object. For example, if you generate testDF from testRDD, then you won\u2019t be able to recover the original RDD of the test class.<\/li>\n<li><strong>DataSet &#8211;<\/strong> It overcomes the limitation of DataFrame to regenerate the RDD from Dataframe.<br \/>\nDatasets allow you to convert your existing RDD and DataFrames into Datasets.<\/li>\n<\/ul>\n<h3>3.6. Compile-time type safety<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong> RDD provides a familiar object-oriented programming style with compile-time type safety.<\/li>\n<li><strong>DataFrame &#8211;\u00a0<\/strong>If you are trying to access the column which does not exist in the table in such case Dataframe APIs does not support compile-time error. It detects attribute error only at runtime.<\/li>\n<li><strong>DataSet &#8211;\u00a0<\/strong> It provides compile-time type safety.<\/li>\n<\/ul>\n<p><strong>Learn:<\/strong> <a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-vs-hadoop-mapreduce\/\"><strong>Apache Spark vs. Hadoop MapReduce<\/strong><\/a><\/p>\n<h3>3.7. Optimization<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong> No inbuilt optimization engine is available in RDD. When working with structured data, RDDs cannot take advantages of sparks advance optimizers. For example, catalyst optimizer and Tungsten execution engine. Developers\u00a0optimise each RDD on the basis of its attributes.<\/li>\n<li><strong>DataFrame &#8211;<\/strong> Optimization takes place using catalyst optimizer. Dataframes use catalyst tree transformation framework in four phases: a) Analyzing a logical plan to resolve references. b) Logical plan optimization. c) Physical planning. d) Code generation to compile parts of the query to Java bytecode. The brief overview of optimization phase is also given in the below figure:<\/li>\n<\/ul>\n<div id=\"attachment_2613\" style=\"width: 1610px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Spark-SQL-Optimization.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-2613\" class=\"wp-image-2613 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Spark-SQL-Optimization.jpg\" alt=\"Spark-SQL-Optimization\" width=\"1600\" height=\"840\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Spark-SQL-Optimization.jpg 1600w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Spark-SQL-Optimization-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Spark-SQL-Optimization-300x158.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Spark-SQL-Optimization-768x403.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Spark-SQL-Optimization-1024x538.jpg 1024w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><\/a><p id=\"caption-attachment-2613\" class=\"wp-caption-text\">Spark-SQL-Optimization<\/p><\/div>\n<ul>\n<li><strong>Dataset &#8211;<\/strong>\u00a0It includes the concept of Dataframe Catalyst optimizer for optimizing query plan.<\/li>\n<\/ul>\n<h3>3.8. Serialization<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0Whenever <strong><a href=\"http:\/\/data-flair.training\/blogs\/important-apache-spark-terminologies-and-concepts-you-must-know\/\">Spark<\/a> <\/strong>needs to distribute the data within the cluster or write the data to disk, it does so use Java serialization. The overhead of serializing individual Java and Scala objects is expensive and requires sending both data and structure between nodes.<\/li>\n<li><strong>DataFrame &#8211;<\/strong>\u00a0Spark DataFrame Can serialize the data into off-heap storage (<a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-in-memory-computing\/\"><strong>in memory<\/strong><\/a>) in binary format and then perform many transformations directly on this off heap memory because spark understands the schema. There is no need to use java serialization to encode the data. It provides a Tungsten physical execution backend which explicitly manages memory and dynamically generates bytecode for expression evaluation.<\/li>\n<li><strong>DataSet &#8211;<\/strong>\u00a0When it comes to serializing data, the Dataset API in Spark has the concept of an encoder which handles conversion between JVM objects to tabular representation. It stores tabular representation using spark internal Tungsten binary format. Dataset allows performing the operation on serialized data and improving memory use. It allows on-demand access to individual attribute without desterilizing the entire object.<\/li>\n<\/ul>\n<h3>3.9. Garbage Collection<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0There is overhead for garbage collection that results from creating and destroying individual objects.<\/li>\n<li><strong>DataFrame &#8211;<\/strong>\u00a0Avoids the garbage collection costs in constructing individual objects for each row in the dataset.<\/li>\n<li><strong>DataSet &#8211;\u00a0<\/strong>There is also no need for the garbage collector to destroy object because serialization<br \/>\ntakes place through Tungsten. That uses off heap data serialization.<\/li>\n<\/ul>\n<p><strong>Learn: <a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-terminologies-concepts-you-must-know\/\">Apache Spark Terminologies and Concepts You Must Know<\/a><\/strong><\/p>\n<h3>3.10. Efficiency\/Memory use<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0Efficiency is decreased when serialization is performed individually on a java and scala object which takes lots of time.<\/li>\n<li><strong>DataFrame &#8211;<\/strong>\u00a0Use of off heap memory for serialization reduces the overhead. It generates byte code dynamically so that many operations can be performed on that serialized data. No need for deserialization for small operations.<\/li>\n<li><strong>DataSet &#8211;<\/strong>\u00a0It allows performing an operation on serialized data and improving memory use. Thus it allows on-demand access to individual attribute without deserializing the entire object.<\/li>\n<\/ul>\n<h3>3.11. Lazy Evolution<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0Spark evaluates RDDs\u00a0lazily. They do not compute their result right away. Instead, they just remember the transformation applied to some base data set. Spark compute Transformations only when an action needs a result to sent to the driver program. Refer this guide if you are new to the <strong><a href=\"http:\/\/data-flair.training\/blogs\/lazy-evaluation-in-apache-spark-guide\/\">Lazy Evaluation feature of Spark<\/a>.<\/strong><\/li>\n<\/ul>\n<div id=\"attachment_3066\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/apache-spark-lazy-evaluation.gif\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-3066\" class=\"wp-image-3066 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/06\/apache-spark-lazy-evaluation.gif\" alt=\"Apache Spark Lazy Evaluation Feature.\" width=\"800\" height=\"450\" \/><\/a><p id=\"caption-attachment-3066\" class=\"wp-caption-text\">Apache Spark Lazy Evaluation Feature.<\/p><\/div>\n<ul>\n<li><strong>DataFrame &#8211;<\/strong>\u00a0Spark evaluates DataFrame lazily, that means computation happens only when action appears (like display result, save output).<\/li>\n<li><strong>DataSet &#8211;<\/strong>\u00a0It also evaluates lazily as RDD and Dataset.<\/li>\n<\/ul>\n<h3>3.12. Programming Language Support<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0RDD APIs are available in <strong>Java, Scala, Python,<\/strong> and<strong><a href=\"http:\/\/data-flair.training\/blogs\/r-programming-tutorial\/\"> R<\/a><\/strong> languages. Hence, this feature provides flexibility to the developers.<\/li>\n<li><strong>DataFrame &#8211;<\/strong>\u00a0It also has APIs in the different languages like Java, Python, Scala, and R.<\/li>\n<li><strong>DataSet &#8211;<\/strong>\u00a0Dataset APIs is currently only available in Scala and Java. Spark version 2.1.1 does not support Python and R.<\/li>\n<\/ul>\n<p>Get the <strong><a href=\"http:\/\/data-flair.training\/blogs\/best-scala-books-list\/\">Best Books of Scala<\/a><\/strong> and<strong><a href=\"http:\/\/data-flair.training\/blogs\/r-best-books-to-learn-r\/\"> R<\/a><\/strong> to become a master.<\/p>\n<h3>3.13. Schema Projection<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0In RDD APIs use schema projection is used explicitly. Hence, we need to define the schema (manually).<\/li>\n<li><strong>DataFrame &#8211;<\/strong>\u00a0Auto-discovering the schema from the files and exposing them as tables through the <strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-hive-metastore\/\">Hive Meta store<\/a><\/strong>. We did this to connect standard SQL clients to our engine. And explore our dataset without defining the schema of our files.<\/li>\n<li><strong>DataSet &#8211;\u00a0<\/strong>Auto discover the schema of the files because of using <strong><a href=\"http:\/\/data-flair.training\/blogs\/spark-sql-tutorial\/\">Spark SQL<\/a> <\/strong>engine.<\/li>\n<\/ul>\n<h3>3.14. Aggregation<\/h3>\n<ul>\n<li><strong>RDD &#8211;<\/strong>\u00a0RDD API is slower to perform simple grouping and aggregation operations.<\/li>\n<li><strong>DataFrame &#8211;<\/strong>\u00a0DataFrame API is very easy to use. It is faster for exploratory analysis, creating aggregated statistics on large data sets.<\/li>\n<li><strong>DataSet &#8211;<\/strong>\u00a0In Dataset it is faster to perform aggregation operation on plenty of data sets.<\/li>\n<\/ul>\n<p><strong>Learn:<a href=\"https:\/\/data-flair.training\/blogs\/scala-spark-shell-commands\/\"> Spark Shell Commands to Interact with Spark-Scala<\/a><\/strong><\/p>\n<h3>3.15. Usage Area<\/h3>\n<p><strong>RDD-<\/strong><\/p>\n<ul>\n<li>You can use RDDs When you want low-level transformation and actions on your data set.<\/li>\n<li>Use RDDs When you need high-level abstractions.<\/li>\n<\/ul>\n<p><strong>DataFrame and DataSet-<\/strong><\/p>\n<ul>\n<li>One can use both DataFrame and dataset API when we need a high level of abstraction.<\/li>\n<li>For unstructured data, such as media streams or streams of text.<\/li>\n<li>You can use both Data Frames or Dataset when you need domain specific APIs.<\/li>\n<li>When you want to manipulate your data with functional programming constructs than domain specific expression.<\/li>\n<li>We can use either datasets or DataFrame in the high-level expression. For example, filter, maps, aggregation, sum, <strong><a href=\"https:\/\/data-flair.training\/blogs\/spark-sql-tutorial\/\">SQL<\/a><\/strong> queries, and columnar access.<\/li>\n<li>When you do not care about imposing a schema, such as columnar format while processing or accessing data attributes by name or column.<\/li>\n<li>in addition, If we want a higher degree of type safety at compile time.<\/li>\n<\/ul>\n<h2>4. Conclusion<\/h2>\n<p>Hence, from the comparison between RDD vs DataFrame vs Dataset, it is clear when to use RDD or DataFrame and\/or Dataset.<br \/>\nAs a result, RDD offers low-level functionality and control. The DataFrame and Dataset allow custom view and structure. It offers high-level domain-specific operations, saves space, and executes at high speed. Select one out of DataFrames and\/or Dataset or RDDs APIs, that meets your needs and play with Spark.<br \/>\nIf you like this post about RDD vs Dataframe vs DataSet so do let me know by leaving a comment.<br \/>\n<strong>See also &#8211;\u00a0<\/strong><\/p>\n<ul>\n<li><strong><a href=\"http:\/\/data-flair.training\/blogs\/apache-spark-features\/\">Sparkling Features of Apache Spark<\/a><\/strong><\/li>\n<li><strong><a href=\"http:\/\/data-flair.training\/blogs\/limitations-of-apache-spark-overcome-spark-drawbacks\/\">Limitations of Apache Spark<\/a><\/strong><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>1. Objective This Spark tutorial will provide you the detailed feature wise comparison between Apache Spark RDD vs DataFrame vs DataSet. We will cover the brief introduction of Spark APIs i.e. RDD, DataFrame and&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":42335,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[898,11349],"class_list":["post-2576","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-spark","tag-apache-spark-apis","tag-rdd-vs-dataframe-vs-dataset"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Apache Spark RDD vs DataFrame vs DataSet - DataFlair<\/title>\n<meta name=\"description\" content=\"Learn comparison between 3 data abstraction in Apache spark RDD vs DataFrame vs dataset performance &amp; usage area of Spark RDD API,DataFrame API,DataSet API.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Spark RDD vs DataFrame vs DataSet - DataFlair\" \/>\n<meta property=\"og:description\" content=\"Learn comparison between 3 data abstraction in Apache spark RDD vs DataFrame vs dataset performance &amp; usage area of Spark RDD API,DataFrame API,DataSet API.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2017-05-10T12:49:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-02-19T06:31:28+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apache Spark RDD vs DataFrame vs DataSet - DataFlair","description":"Learn comparison between 3 data abstraction in Apache spark RDD vs DataFrame vs dataset performance & usage area of Spark RDD API,DataFrame API,DataSet API.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/","og_locale":"en_US","og_type":"article","og_title":"Apache Spark RDD vs DataFrame vs DataSet - DataFlair","og_description":"Learn comparison between 3 data abstraction in Apache spark RDD vs DataFrame vs dataset performance & usage area of Spark RDD API,DataFrame API,DataSet API.","og_url":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2017-05-10T12:49:16+00:00","article_modified_time":"2019-02-19T06:31:28+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"Apache Spark RDD vs DataFrame vs DataSet","datePublished":"2017-05-10T12:49:16+00:00","dateModified":"2019-02-19T06:31:28+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/"},"wordCount":1830,"commentCount":21,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1.jpg","keywords":["Apache Spark APIs","Rdd vs Dataframe vs Dataset"],"articleSection":["Apache Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/","url":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/","name":"Apache Spark RDD vs DataFrame vs DataSet - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1.jpg","datePublished":"2017-05-10T12:49:16+00:00","dateModified":"2019-02-19T06:31:28+00:00","description":"Learn comparison between 3 data abstraction in Apache spark RDD vs DataFrame vs dataset performance & usage area of Spark RDD API,DataFrame API,DataSet API.","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/05\/Apache-Spark-RDD-vs-DataFrame-vs-DataSet-1.jpg","width":1200,"height":628,"caption":"Apache Spark RDD vs DataFrame vs DataSet"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/apache-spark-rdd-vs-dataframe-vs-dataset\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/spark\/"},{"@type":"ListItem","position":3,"name":"Apache Spark RDD vs DataFrame vs DataSet"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2576","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=2576"}],"version-history":[{"count":6,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2576\/revisions"}],"predecessor-version":[{"id":50201,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2576\/revisions\/50201"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/42335"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=2576"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=2576"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=2576"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}