

{"id":5626,"date":"2018-01-17T10:56:25","date_gmt":"2018-01-17T10:56:25","guid":{"rendered":"https:\/\/data-flair.training\/blogs\/?p=5626"},"modified":"2018-09-17T16:28:18","modified_gmt":"2018-09-17T10:58:18","slug":"sparkr-dataframe","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/","title":{"rendered":"SparkR DataFrame and DataFrame Operations"},"content":{"rendered":"<h2>1. Objective<\/h2>\n<p>In this article, we will learn the whole concept of SparkR DataFrame. Further, we will also learn <strong><a href=\"https:\/\/data-flair.training\/blogs\/sparkr\/\">SparkR <\/a><\/strong>DataFrame Operations and how to run SQL queries from SparkR.<\/p>\n<div id=\"attachment_6004\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-6004\" class=\"wp-image-6004 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1.jpg\" alt=\"SparkR DataFrame operations\" width=\"1200\" height=\"628\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1.jpg 1200w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-768x402.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-1024x536.jpg 1024w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><p id=\"caption-attachment-6004\" class=\"wp-caption-text\">SparkR DataFrame operations<\/p><\/div>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-online-quiz-part-1\/\">You must test your Spark Learning so far\u00a0<\/a><\/strong><\/p>\n<h2>2. SparkR DataFrame<\/h2>\n<p><span style=\"font-weight: 400\">Data is organized as a distributed collection of data into named columns. That we call on SparkDataFrame. Basically, it is as same as a table in a relational database or a data frame in <strong><a href=\"https:\/\/data-flair.training\/blogs\/r-programming-tutorial\/\">R<\/a><\/strong>. Moreover, we can construct a <a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-sql-dataframe-tutorial\/\"><strong>DataFrame<\/strong><\/a> from a wide array of sources. For example structured data files, tables in Hive, external databases. Also, existing local R data frames are used for construction<\/span><\/p>\n<h2>3. SparkR DataFrame Operations<\/h2>\n<p><span style=\"font-weight: 400\">Basically, for structured data processing, SparkDataFrames supports many functions.<\/span> \u00a0Let\u2019s discuss some basic examples of it:<\/p>\n<h3>i. Selecting rows, columns<\/h3>\n<p># Create the SparkDataFrame<br \/>\ndf &lt;- as.DataFrame(faithful)<br \/>\n# Get basic information about the SparkDataFrame<br \/>\ndf<br \/>\n## SparkDataFrame[eruptions:double, waiting:double]<br \/>\n# Select only the &#8220;eruptions&#8221; column<br \/>\nhead(select(df, df$eruptions))<br \/>\n## \u00a0eruptions<br \/>\n##1 \u00a0\u00a0\u00a0\u00a03.600<br \/>\n##2 \u00a0\u00a0\u00a0\u00a01.800<br \/>\n##3 \u00a0\u00a0\u00a0\u00a03.333<br \/>\n# You can also pass in column name as strings<br \/>\nhead(select(df, &#8220;eruptions&#8221;))<br \/>\n# Filter the SparkDataFrame to only retain rows with wait times shorter than 50 mins<br \/>\nhead(filter(df, df$waiting &lt; 50))<br \/>\n## \u00a0eruptions waiting<br \/>\n##1 \u00a0\u00a0\u00a0\u00a01.750 \u00a0\u00a0\u00a0\u00a0\u00a047<br \/>\n##2 \u00a0\u00a0\u00a0\u00a01.750 \u00a0\u00a0\u00a0\u00a0\u00a047<br \/>\n##3 \u00a0\u00a0\u00a0\u00a01.867 \u00a0\u00a0\u00a0\u00a0\u00a048<\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/spark-paired-rdd\/\">Have a look at Spark Paired RDD<\/a><\/strong><\/p>\n<h3>ii. Grouping, Aggregation<\/h3>\n<p><span style=\"font-weight: 400\">To aggregate data after grouping, SparkR DataFrame support various commonly used functions. <\/span><br \/>\n<span style=\"font-weight: 400\">For example, as shown below we are computing a histogram of the waiting time in the faithful dataset.<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span># We use the `n` operator to count the number of times each waiting time appears<br \/>\nhead(summarize(groupBy(df, df$waiting), count = n(df$waiting)))<br \/>\n## \u00a0waiting count<br \/>\n##1 \u00a0\u00a0\u00a0\u00a0\u00a070 \u00a0\u00a0\u00a0\u00a04<br \/>\n##2 \u00a0\u00a0\u00a0\u00a0\u00a067 \u00a0\u00a0\u00a0\u00a01<br \/>\n##3 \u00a0\u00a0\u00a0\u00a0\u00a069 \u00a0\u00a0\u00a0\u00a02<br \/>\n# We can also sort the output from the aggregation to get the most common waiting times<br \/>\nwaiting_counts &lt;- summarize(groupBy(df, df$waiting), count = n(df$waiting))<br \/>\nhead(arrange(waiting_counts, desc(waiting_counts$count)))<br \/>\n## \u00a0\u00a0waiting count<br \/>\n##1 \u00a0\u00a0\u00a0\u00a0\u00a078 \u00a0\u00a0\u00a015<br \/>\n##2 \u00a0\u00a0\u00a0\u00a0\u00a083 \u00a0\u00a0\u00a014<br \/>\n##3 \u00a0\u00a0\u00a0\u00a0\u00a081 \u00a0\u00a0\u00a013<\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/spark-executor\/\">Do you know about Spark Executor<\/a><\/strong><\/p>\n<h3>iii. Operating on Columns<\/h3>\n<p><span style=\"font-weight: 400\">There are some Functions that can directly apply to columns. It works for data processing and during aggregation. Here, an example shows the use of basic arithmetic functions.<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span># Convert waiting time from hours to seconds.<br \/>\n# Note that we can assign this to a new column in the same SparkDataFrame<br \/>\ndf$waiting_secs &lt;- df$waiting * 60<br \/>\nhead(df)<br \/>\n## \u00a0eruptions waiting waiting_secs<br \/>\n##1 \u00a0\u00a0\u00a0\u00a03.600 \u00a0\u00a0\u00a0\u00a0\u00a079 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a04740<br \/>\n##2 \u00a0\u00a0\u00a0\u00a01.800 \u00a0\u00a0\u00a0\u00a0\u00a054 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a03240<br \/>\n##3 \u00a0\u00a0\u00a0\u00a03.333 \u00a0\u00a0\u00a0\u00a0\u00a074 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a04440<\/p>\n<h3>iv. Applying User-Defined Function<\/h3>\n<p><span style=\"font-weight: 400\">There are various kinds of User-Defined Functions supported \u00a0in SparkR:<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><strong>a. Run a given function on a large dataset using dapply or dapplyCollect<\/strong><\/p>\n<ul>\n<li><strong>Dapply<\/strong><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">Basically, Dapply applies a function to each partition of a SparkDataFrame.<\/span> That function to be applied to each partition of the SparkDataFrame. There is one parameter required, to which a data.frame corresponds to each partition will be passed. The output of function should be a data.frame. The schema specifies the row format of the resulting a SparkDataFrame. It must match to data types of returned value.<\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/learn-apache-spark-sparkcontext\/\">Have a look at SparkContext<br \/>\n<\/a><\/strong># Convert waiting time from hours to seconds.<br \/>\n# Note that we can apply UDF to DataFrame.<br \/>\nschema &lt;- structType(structField(&#8220;eruptions&#8221;, &#8220;double&#8221;), structField(&#8220;waiting&#8221;, &#8220;double&#8221;),<br \/>\nstructField(&#8220;waiting_secs&#8221;, &#8220;double&#8221;))<br \/>\ndf1 &lt;- dapply(df, function(x) { x &lt;- cbind(x, x$waiting * 60) }, schema)<br \/>\nhead(collect(df1))<br \/>\n## \u00a0eruptions waiting waiting_secs<br \/>\n##1 \u00a0\u00a0\u00a0\u00a03.600 \u00a0\u00a0\u00a0\u00a0\u00a079 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a04740<br \/>\n##2 \u00a0\u00a0\u00a0\u00a01.800 \u00a0\u00a0\u00a0\u00a0\u00a054 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a03240<br \/>\n##3 \u00a0\u00a0\u00a0\u00a03.333 \u00a0\u00a0\u00a0\u00a0\u00a074 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a04440<br \/>\n##4 \u00a0\u00a0\u00a0\u00a02.283 \u00a0\u00a0\u00a0\u00a0\u00a062 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a03720<br \/>\n##5 \u00a0\u00a0\u00a0\u00a04.533 \u00a0\u00a0\u00a0\u00a0\u00a085 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a05100<br \/>\n##6 \u00a0\u00a0\u00a0\u00a02.883 \u00a0\u00a0\u00a0\u00a0\u00a055 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a03300<\/p>\n<ul>\n<li><strong>dapplyCollect<\/strong><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">As same as dapply, apply a function to each partition of a SparkDataFrame. Also collect the result back. The output of function should be a data.frame. However, Schema is not required to be passed. <\/span><br \/>\n<span style=\"font-weight: 400\">Note: \u00a0dapplyCollect can fail if the output of UDF run on all the partition. It cannot be pulled to the driver and fit in driver memory.<\/span><\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/spark-machine-learning-with-r\/\">Do you know about Spark Machine Learning with R<br \/>\n<\/a><\/strong><span style=\"font-weight: 400\"><br \/>\n<\/span># Convert waiting time from hours to seconds.<br \/>\n# Note that we can apply UDF to DataFrame and return a R&#8217;s data.frame<br \/>\nldf &lt;- dapplyCollect(<br \/>\ndf,<br \/>\nfunction(x) {<br \/>\nx &lt;- cbind(x, &#8220;waiting_secs&#8221; = x$waiting * 60)<br \/>\n})<br \/>\nhead(ldf, 3)<br \/>\n## \u00a0eruptions waiting waiting_secs<br \/>\n##1 \u00a0\u00a0\u00a0\u00a03.600 \u00a0\u00a0\u00a0\u00a0\u00a079 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a04740<br \/>\n##2 \u00a0\u00a0\u00a0\u00a01.800 \u00a0\u00a0\u00a0\u00a0\u00a054 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a03240<br \/>\n##3 \u00a0\u00a0\u00a0\u00a03.333 \u00a0\u00a0\u00a0\u00a0\u00a074 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a04440<\/p>\n<p><strong>b. Run a given function on a large dataset grouping by input column(s) and using gapply or gapplyCollect<\/strong><\/p>\n<ul>\n<li><strong>Gapply<\/strong><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">Apply a function to each group of a SparkDataFrame. The function is to be applied to each group of the SparkDataFrame and should have only two parameters: grouping key and R data.frame corresponding to that key. The groups are chosen from SparkDataFrames column(s). The output of function should be a data.frame. The schema specifies the row format of the resulting SparkDataFrame. It must represent R function\u2019s output schema on the basis of Spark data types. The column names of the returned data.frame are set by the user.<\/span><\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/spark-interview-questions\/\">Best preparation for upcoming Spark Interview<br \/>\n<\/a><\/strong><span style=\"font-weight: 400\"><br \/>\n<\/span># Determine six waiting times with the largest eruption time in minutes.<br \/>\nschema &lt;- structType(structField(&#8220;waiting&#8221;, &#8220;double&#8221;), structField(&#8220;max_eruption&#8221;, &#8220;double&#8221;))<br \/>\nresult &lt;- gapply(<br \/>\ndf,<br \/>\n&#8220;waiting&#8221;,<br \/>\nfunction(key, x) {<br \/>\ny &lt;- data.frame(key, max(x$eruptions))<br \/>\n},<br \/>\nschema)<br \/>\nhead(collect(arrange(result, &#8220;max_eruption&#8221;, decreasing = TRUE)))<br \/>\n## \u00a0\u00a0\u00a0waiting \u00a0\u00a0max_eruption<br \/>\n##1 \u00a0\u00a0\u00a0\u00a0\u00a064 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a05.100<br \/>\n##2 \u00a0\u00a0\u00a0\u00a0\u00a069 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a05.067<br \/>\n##3 \u00a0\u00a0\u00a0\u00a0\u00a071 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a05.033<br \/>\n##4 \u00a0\u00a0\u00a0\u00a0\u00a087 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a05.000<br \/>\n##5 \u00a0\u00a0\u00a0\u00a0\u00a063 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a04.933<br \/>\n##6 \u00a0\u00a0\u00a0\u00a0\u00a089 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a04.900<\/p>\n<ul>\n<li><strong>gapplyCollect<\/strong><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">As same as gapply, it applies a function to each partition of a SparkDataFrame. \u00a0Also collects the result back to R data.frame. Here, also the output of the function should be a data.frame. Although, the schema is not required here to be passed. <\/span><br \/>\n<b>Note:<\/b><span style=\"font-weight: 400\"> gapplyCollect can fail if the output of UDF run on all the partition. It cannot be pulled to the driver. Also, fit in driver memory.<\/span><\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/spark-stage\/\">Let&#8217;s revise the Spark Stage<br \/>\n<\/a><\/strong><span style=\"font-weight: 400\"><br \/>\n<\/span># Determine six waiting times with the largest eruption time in minutes.<br \/>\nresult &lt;- gapplyCollect(<br \/>\ndf,<br \/>\n&#8220;waiting&#8221;,<br \/>\nfunction(key, x) {<br \/>\ny &lt;- data.frame(key, max(x$eruptions))<br \/>\ncolnames(y) &lt;- c(&#8220;waiting&#8221;, &#8220;max_eruption&#8221;)<br \/>\ny<br \/>\n})<br \/>\nhead(result[order(result$max_eruption, decreasing = TRUE), ])<br \/>\n## \u00a0\u00a0\u00a0waiting \u00a0\u00a0max_eruption<br \/>\n##1 \u00a0\u00a0\u00a0\u00a0\u00a064 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a05.100<br \/>\n##2 \u00a0\u00a0\u00a0\u00a0\u00a069 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a05.067<br \/>\n##3 \u00a0\u00a0\u00a0\u00a0\u00a071 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a05.033<br \/>\n##4 \u00a0\u00a0\u00a0\u00a0\u00a087 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a05.000<br \/>\n##5 \u00a0\u00a0\u00a0\u00a0\u00a063 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a04.933<br \/>\n##6 \u00a0\u00a0\u00a0\u00a0\u00a089 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a04.900<\/p>\n<p><strong>c.\u00a0 Run local R functions distributed using spark.lapply<\/strong><br \/>\n<strong>Spark.lapply<\/strong><br \/>\n<span style=\"font-weight: 400\">As Similar as lapply in native R, spark.lapply runs a function over a list of elements. Also distributes the computations with Spark. The manner in which it Applies a function is similar to doParallel or lapply to elements of a list. \u00a0Basically, all the results of all computations should fit on a single machine.<\/span><\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/graphx-api-spark\/\">You must know about Spark GraphX API<br \/>\n<\/a><\/strong><span style=\"font-weight: 400\"><br \/>\n<\/span># Perform distributed training of multiple models with spark.lapply. Here, we pass<br \/>\n# a read-only list of arguments which specifies family the generalized linear model should be.<br \/>\nfamilies &lt;- c(&#8220;gaussian&#8221;, &#8220;poisson&#8221;)<br \/>\ntrain &lt;- function(family) {<br \/>\nmodel &lt;- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family)<br \/>\nsummary(model)<br \/>\n}<br \/>\n# Return a list of model&#8217;s summaries<br \/>\nmodel.summaries &lt;- spark.lapply(families, train)<br \/>\n# Print the summary of each model<br \/>\nprint(model.summaries)<\/p>\n<h3>v. Running SQL Queries from SparkR<\/h3>\n<p><span style=\"font-weight: 400\">We can also register a SparkR DataFrame as a temporary view in Spark <a href=\"https:\/\/data-flair.training\/blogs\/sql-tutorial\/\"><strong>SQL<\/strong><\/a>. Basically, that allows us to run SQL queries over its data. Moreover, to run SQL queries programmatically, sql function enables applications. Also, returns the result as a SparkDataFrame.<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span><span style=\"font-weight: 400\"><br \/>\n<\/span># Load a JSON file<br \/>\npeople &lt;- read.df(&#8220;.\/examples\/src\/main\/resources\/people.json&#8221;, &#8220;json&#8221;)<br \/>\n# Register this SparkDataFrame as a temporary view.<br \/>\ncreateOrReplaceTempView(people, &#8220;people&#8221;)<br \/>\n# SQL statements can be run by using the sql method<br \/>\nteenagers &lt;- sql(&#8220;SELECT name FROM people WHERE age &gt;= 13 AND age &lt;= 19&#8221;)<br \/>\nhead(teenagers)<br \/>\n## \u00a0\u00a0\u00a0name<br \/>\n##1 Justin<\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/batch-processing-vs-real-time-processing\/\">Let&#8217;s check the comparison of Spark\u00a0Batch Processing and Real-time Processing<\/a><\/strong><\/p>\n<p>So, this was all in SparkR DataFrame Tutorial. Hope you like our explanation.<\/p>\n<h2>4. Conclusion &#8211; SparkR DataFrame<\/h2>\n<p>As a result, we have seen all the SparkR DataFrame Operations. Also, we have seen several examples to understand the topic well. Although, if any query occurs, feel free to ask in the comment section.<\/p>\n<p><strong>See also &#8211;\u00a0<\/strong><\/p>\n<p><strong><a href=\"https:\/\/data-flair.training\/blogs\/spark-quiz-questions-part-2\/\">Spark Quiz<\/a><\/strong><\/p>\n<p><strong><a href=\"https:\/\/en.wikipedia.org\/wiki\/Apache_Spark\">Reference for Spark<\/a><\/strong><span hidden class=\"__iawmlf-post-loop-links\" data-iawmlf-links=\"[{&quot;id&quot;:1357,&quot;href&quot;:&quot;https:\\\/\\\/en.wikipedia.org\\\/wiki\\\/Apache_Spark&quot;,&quot;archived_href&quot;:&quot;http:\\\/\\\/web-wp.archive.org\\\/web\\\/20250922221612\\\/https:\\\/\\\/en.wikipedia.org\\\/wiki\\\/Apache_Spark&quot;,&quot;redirect_href&quot;:&quot;&quot;,&quot;checks&quot;:[{&quot;date&quot;:&quot;2025-12-09 05:27:27&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-12 10:08:16&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-15 10:54:44&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-18 15:58:49&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-21 22:36:30&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-25 05:31:45&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-28 12:45:42&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-31 14:24:43&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-03 17:46:17&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-07 06:00:10&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-10 18:44:33&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-14 03:23:51&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-17 07:55:39&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-20 08:53:11&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-23 13:06:21&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-26 19:31:27&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-30 03:59:32&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-02 04:29:15&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-05 06:45:01&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-08 15:14:08&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-11 17:11:37&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-14 17:21:25&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-17 19:54:27&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-21 15:31:35&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-24 16:57:05&quot;,&quot;http_code&quot;:429},{&quot;date&quot;:&quot;2026-02-27 17:43:21&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-02 18:00:05&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-06 08:59:01&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-09 10:45:21&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-12 12:05:44&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-15 13:52:04&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-18 16:22:15&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-22 02:26:17&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-25 06:42:29&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-28 13:17:46&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-31 19:34:11&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-03 21:06:08&quot;,&quot;http_code&quot;:503},{&quot;date&quot;:&quot;2026-04-07 13:23:55&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-10 15:12:24&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-14 01:00:09&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-17 15:03:23&quot;,&quot;http_code&quot;:429},{&quot;date&quot;:&quot;2026-04-20 17:12:48&quot;,&quot;http_code&quot;:429},{&quot;date&quot;:&quot;2026-04-23 18:14:30&quot;,&quot;http_code&quot;:404},{&quot;date&quot;:&quot;2026-04-26 23:59:57&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-30 03:29:22&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-03 03:48:13&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-06 06:11:43&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-09 10:25:28&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-12 12:20:35&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-15 15:48:18&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-19 00:06:09&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-22 12:24:50&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-25 12:59:28&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-28 18:04:56&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-01 07:34:11&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-04 09:52:56&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-07 13:28:25&quot;,&quot;http_code&quot;:404},{&quot;date&quot;:&quot;2026-06-10 15:46:34&quot;,&quot;http_code&quot;:404},{&quot;date&quot;:&quot;2026-06-14 08:05:27&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-18 01:16:15&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-21 13:30:04&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-24 15:27:50&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-27 17:21:08&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-07-01 06:45:50&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-07-04 09:53:27&quot;,&quot;http_code&quot;:200}],&quot;broken&quot;:false,&quot;last_checked&quot;:{&quot;date&quot;:&quot;2026-07-04 09:53:27&quot;,&quot;http_code&quot;:200},&quot;process&quot;:&quot;done&quot;}]\"><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Objective In this article, we will learn the whole concept of SparkR DataFrame. Further, we will also learn SparkR DataFrame Operations and how to run SQL queries from SparkR. You must test your&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":34402,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[3537,13167],"class_list":["post-5626","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-spark","tag-dataframe-operations-in-sparkr","tag-sparkr-dataframe-operations"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>SparkR DataFrame and DataFrame Operations - DataFlair<\/title>\n<meta name=\"description\" content=\"In this article, we will learn the whole concept of SparkR DataFrame. Further, we will also learn SparkR DataFrame Operations.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"SparkR DataFrame and DataFrame Operations - DataFlair\" \/>\n<meta property=\"og:description\" content=\"In this article, we will learn the whole concept of SparkR DataFrame. Further, we will also learn SparkR DataFrame Operations.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-17T10:56:25+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-09-17T10:58:18+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"SparkR DataFrame and DataFrame Operations - DataFlair","description":"In this article, we will learn the whole concept of SparkR DataFrame. Further, we will also learn SparkR DataFrame Operations.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/","og_locale":"en_US","og_type":"article","og_title":"SparkR DataFrame and DataFrame Operations - DataFlair","og_description":"In this article, we will learn the whole concept of SparkR DataFrame. Further, we will also learn SparkR DataFrame Operations.","og_url":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2018-01-17T10:56:25+00:00","article_modified_time":"2018-09-17T10:58:18+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-1.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"SparkR DataFrame and DataFrame Operations","datePublished":"2018-01-17T10:56:25+00:00","dateModified":"2018-09-17T10:58:18+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/"},"wordCount":1275,"commentCount":0,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-1.jpg","keywords":["dataframe operations in sparkr","sparkR dataframe operations"],"articleSection":["Apache Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/","url":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/","name":"SparkR DataFrame and DataFrame Operations - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-1.jpg","datePublished":"2018-01-17T10:56:25+00:00","dateModified":"2018-09-17T10:58:18+00:00","description":"In this article, we will learn the whole concept of SparkR DataFrame. Further, we will also learn SparkR DataFrame Operations.","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-1.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Spark-DataFrame-Operations-in-SparkR-01-1-1.jpg","width":1200,"height":628,"caption":"SparkR DataFrame and DataFrame Operations"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/sparkr-dataframe\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/spark\/"},{"@type":"ListItem","position":3,"name":"SparkR DataFrame and DataFrame Operations"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/5626","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=5626"}],"version-history":[{"count":6,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/5626\/revisions"}],"predecessor-version":[{"id":34403,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/5626\/revisions\/34403"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/34402"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=5626"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=5626"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=5626"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}