

{"id":5596,"date":"2018-01-16T10:19:12","date_gmt":"2018-01-16T10:19:12","guid":{"rendered":"https:\/\/data-flair.training\/blogs\/?p=5596"},"modified":"2018-11-16T13:55:24","modified_gmt":"2018-11-16T08:25:24","slug":"sparkr","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/sparkr\/","title":{"rendered":"SparkR in Apache Spark | Operations &amp; Algorithms in Spark and R"},"content":{"rendered":"<div class='__iawmlf-post-loop-links' style='display:none;' data-iawmlf-post-links='[{&quot;id&quot;:1357,&quot;href&quot;:&quot;https:\\\/\\\/en.wikipedia.org\\\/wiki\\\/Apache_Spark&quot;,&quot;archived_href&quot;:&quot;http:\\\/\\\/web-wp.archive.org\\\/web\\\/20250922221612\\\/https:\\\/\\\/en.wikipedia.org\\\/wiki\\\/Apache_Spark&quot;,&quot;redirect_href&quot;:&quot;&quot;,&quot;checks&quot;:[{&quot;date&quot;:&quot;2025-12-09 05:27:27&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-12 10:08:16&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-15 10:54:44&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-18 15:58:49&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-21 22:36:30&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-25 05:31:45&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-28 12:45:42&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2025-12-31 14:24:43&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-03 17:46:17&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-07 06:00:10&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-10 18:44:33&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-14 03:23:51&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-17 07:55:39&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-20 08:53:11&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-23 13:06:21&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-26 19:31:27&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-01-30 03:59:32&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-02 04:29:15&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-05 06:45:01&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-08 15:14:08&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-11 17:11:37&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-14 17:21:25&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-17 19:54:27&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-21 15:31:35&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-02-24 16:57:05&quot;,&quot;http_code&quot;:429},{&quot;date&quot;:&quot;2026-02-27 17:43:21&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-02 18:00:05&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-06 08:59:01&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-09 10:45:21&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-12 12:05:44&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-15 13:52:04&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-18 16:22:15&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-22 02:26:17&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-25 06:42:29&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-28 13:17:46&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-03-31 19:34:11&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-03 21:06:08&quot;,&quot;http_code&quot;:503},{&quot;date&quot;:&quot;2026-04-07 13:23:55&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-10 15:12:24&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-14 01:00:09&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-17 15:03:23&quot;,&quot;http_code&quot;:429},{&quot;date&quot;:&quot;2026-04-20 17:12:48&quot;,&quot;http_code&quot;:429},{&quot;date&quot;:&quot;2026-04-23 18:14:30&quot;,&quot;http_code&quot;:404},{&quot;date&quot;:&quot;2026-04-26 23:59:57&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-04-30 03:29:22&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-03 03:48:13&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-06 06:11:43&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-09 10:25:28&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-12 12:20:35&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-15 15:48:18&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-19 00:06:09&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-22 12:24:50&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-25 12:59:28&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-05-28 18:04:56&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-01 07:34:11&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-04 09:52:56&quot;,&quot;http_code&quot;:200},{&quot;date&quot;:&quot;2026-06-07 13:28:25&quot;,&quot;http_code&quot;:404}],&quot;broken&quot;:false,&quot;last_checked&quot;:{&quot;date&quot;:&quot;2026-06-07 13:28:25&quot;,&quot;http_code&quot;:404},&quot;process&quot;:&quot;done&quot;}]'><\/div>\n<h2><b>1. Objective<\/b><\/h2>\n<p>Today, in this article, we will learn the whole concept of SparkR. First, we will learn the process of creating SparkR DataFrames. Moreover, we will also learn some SparkR DataFrame operations. Further, we will learn MLlib algorithms exposed by Spark and R. Basically, an <strong><a href=\"https:\/\/data-flair.training\/blogs\/r-packages-tutorial\/\">R package<\/a><\/strong> that provides a light-weight frontend to use Apache Spark from R is what we call SparkR.<\/p>\n<div id=\"attachment_42323\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-42323\" class=\"size-full wp-image-42323\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01.jpg\" alt=\"SparkR in Apache Spark \" width=\"1200\" height=\"628\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01.jpg 1200w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01-768x402.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01-1024x536.jpg 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01-520x272.jpg 520w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><p id=\"caption-attachment-42323\" class=\"wp-caption-text\">SparkR in Apache Spark<\/p><\/div>\n<h2><b>2. What is SparkR?<\/b><\/h2>\n<p>SparkR is nothing but an R package. <span class=\"adverb\">Basically<\/span>, that provides a light-weight frontend to use <a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-for-beginners\/\">Apache Spark<\/a> from R . <span class=\"adverb\">Initially<\/span>, with Spark 1.4.x, it offers a distributed DataFrame implementation.\u00a0Also, supports various operations. For example, selection, filtering, aggregation and many more. Moreover, using MLlib it also supports distributed machine learning.<\/p>\n<h2><b>3. SparkDataFrame in SparkR<\/b><\/h2>\n<p>Data <span class=\"passivevoice\">is organized<\/span> as a distributed collection of data into named columns. That we call on <a href=\"https:\/\/data-flair.training\/blogs\/apache-spark-sql-dataframe-tutorial\/\">Spark DataFrame<\/a>. <span class=\"adverb\">Basically<\/span>, it is as same as a table in a relational database or a data frame in R. Moreover, we can construct a DataFrame from a wide array of sources. For example structured data files, <a href=\"https:\/\/data-flair.training\/blogs\/hive-internal-tables-vs-external-tables-comparison\/\">tables in Hive<\/a>, external databases. Also, existing local R data frames <span class=\"passivevoice\">are used<\/span> for construction.<\/p>\n<h2><b>4. Starting Up: SparkSession<\/b><\/h2>\n<p>Basically, SparkSession is an entry point into SparkR. Also, connects your R program to a Spark cluster. In addition, By using sparkR.session, we can create a SparkSession. Also, pass in options such as the application name, any spark packages depended on and many more. \u00a0Moreover, we can work with SparkDataFrames via SparkSession.<br \/>\nIn addition,\u00a0there is a condition that SparkSession should already <span class=\"passivevoice\">be created t<\/span>o work from the sparkR shell. Hence, we would not need to call sparkR.session.<br \/>\nMoreover, there is a condition that SparkSession should already <span class=\"passivevoice\">be created<\/span>.\u00a0Since we want to work from the SparkR shell. Hence, we would not need to call sparkR.session.<br \/>\nsparkR.session()<\/p>\n<h2><b>5. Creating SparkDataFrames in SparkR<\/b><\/h2>\n<div id=\"attachment_5620\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Creating-SparkDataFrames-in-SparkR-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5620\" class=\"wp-image-5620 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Creating-SparkDataFrames-in-SparkR-01.jpg\" alt=\"SparkR\" width=\"800\" height=\"800\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Creating-SparkDataFrames-in-SparkR-01.jpg 800w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Creating-SparkDataFrames-in-SparkR-01-150x150.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Creating-SparkDataFrames-in-SparkR-01-300x300.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Creating-SparkDataFrames-in-SparkR-01-768x768.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Creating-SparkDataFrames-in-SparkR-01-100x100.jpg 100w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><p id=\"caption-attachment-5620\" class=\"wp-caption-text\">Creating SparkDataFrames in Spark and R<\/p><\/div>\n<p>Basically, by using a SparkSession, applications can create SparkDataFrames. For example data sources like local R data frame, Hive table, or other data sources.<\/p>\n<h3><b>i. From local dataframes<\/b><\/h3>\n<p>Basically, we can create a data frame in the very simplest way. By just only conversion of a local R data frame into a Spark DataFrame. Although, we can create by using as.DataFrame or createDataFrame. Also, by passing in the local R data frame to create a Spark DataFrame.<br \/>\nHence, by using the faithful dataset from R, we are creating a SparkDataFrame based.<br \/>\nFor Example<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">df &lt;- as.DataFrame(faithful)\r\n# Displays the first part of the SparkDataFrame\r\nhead(df)\r\n## \u00a0eruptions waiting\r\n##1 \u00a0\u00a0\u00a0\u00a03.600 \u00a0\u00a0\u00a0\u00a0\u00a079\r\n##2 \u00a0\u00a0\u00a0\u00a01.800 \u00a0\u00a0\u00a0\u00a0\u00a054\r\n##3 \u00a0\u00a0\u00a0\u00a03.333 \u00a0\u00a0\u00a0\u00a0\u00a074<\/pre>\n<h3><b>ii. From Data Sources<\/b><\/h3>\n<p>Through the SparkDataFrame interface, SparkR supports operating on a variety of data sources. Basically, for creating SparkDataFrames, the general method from data sources is read.df. Generally, This method takes in the path for the file to load. Also, the type of data source and the currently active SparkSession will be automatically used. Moreover, It supports reading JSON, CSV and Parquet files natively.<br \/>\nIn addition, we can add these packages by specifying two conditions. Such as, if packages with spark-submit or its commands. Else, if initializing SparkSession with the spark Packages parameter. Either in an interactive R shell or from RStudio.<br \/>\nsparkR.session(sparkPackages = &#8220;com.databricks:spark-avro_2.11:3.0.0&#8221;)<br \/>\nBasically, we have seen how to use data sources using an example JSON input file. Although the file that <span class=\"passivevoice\">is used<\/span> here is not a typical JSON file. <span class=\"adverb\">Basically<\/span>, each line in the file must contain a separate, valid JSON object.<br \/>\n<b>For Example<\/b><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">people &lt;- read.df(\".\/examples\/src\/main\/resources\/people.json\", \"json\")\r\nhead(people)\r\n## \u00a0age \u00a0\u00a0\u00a0name\r\n##1 \u00a0NA Michael\r\n##2 \u00a030 \u00a0\u00a0\u00a0Andy\r\n##3 \u00a019 \u00a0Justin\r\n# SparkR automatically infers the schema from the JSON file\r\nprintSchema(people)\r\n# root\r\n# \u00a0|-- age: long (nullable = true)\r\n# \u00a0|-- name: string (nullable = true)\r\n# Similarly, multiple files can be read with read.json\r\npeople &lt;- read.json(c(\".\/examples\/src\/main\/resources\/people.json\", \".\/examples\/src\/main\/resources\/people2.json\"))\r\nIn addition, the data sources API natively supports CSV formatted input files.\r\ndf &lt;- read.df(csvPath, \"csv\", header = \"true\", inferSchema = \"true\", na.strings = \"NA\")<\/pre>\n<p>The data sources API can also be used to save out SparkDataFrames into multiple file formats.<br \/>\nFor example, we can save the SparkDataFrame from the previous example to a Parquet file using write.df.<br \/>\nwrite.df(people, path = &#8220;people.parquet&#8221;, source = &#8220;parquet&#8221;, mode = &#8220;overwrite&#8221;)<\/p>\n<h3>iii. SparkDataFrames from Hive tables<\/h3>\n<p>Creation of SparkDataFrames is also possible from Hive tables. At first, we will need to create a SparkSession with Hive support. <span class=\"adverb\">Basically<\/span>, that helps to access tables in the Hive MetaStore. Moreover, there is only one need that Spark should have <span class=\"passivevoice\">been built<\/span> with Hive support. Although, SparkR attempt to create a SparkSession with Hive support enabled by default. (enableHiveSupport = TRUE).<br \/>\nsparkR.session()<br \/>\nFor Example<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">sql(\"CREATE TABLE IF NOT EXISTS src (key INT, value STRING)\")\r\nsql(\"LOAD DATA LOCAL INPATH 'examples\/src\/main\/resources\/kv1.txt' INTO TABLE src\")\r\n# Queries can be expressed in HiveQL.\r\nresults &lt;- sql(\"FROM src SELECT key, value\")\r\n# results is now a SparkDataFrame\r\nhead(results)\r\n## \u00a0key \u00a0\u00a0value\r\n## 1 238 val_238\r\n## 2 \u00a086 \u00a0val_86\r\n## 3 311 val_311<\/pre>\n<h2><b>6. SparkDataFrame Operations<\/b><\/h2>\n<div id=\"attachment_5621\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkDataFrame-Operations-in-SparkR-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5621\" class=\"wp-image-5621 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkDataFrame-Operations-in-SparkR-01.jpg\" alt=\"SparkR\" width=\"800\" height=\"800\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkDataFrame-Operations-in-SparkR-01.jpg 800w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkDataFrame-Operations-in-SparkR-01-150x150.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkDataFrame-Operations-in-SparkR-01-300x300.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkDataFrame-Operations-in-SparkR-01-768x768.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkDataFrame-Operations-in-SparkR-01-100x100.jpg 100w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><p id=\"caption-attachment-5621\" class=\"wp-caption-text\">SparkDataFrame Operations<\/p><\/div>\n<p>Basically, for structured data processing SparkDataFrames support a number of functions. Some of the basic examples are:<\/p>\n<h3><b>i. Selecting rows, columns<\/b><\/h3>\n<p>For Example<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\"># Create the SparkDataFrame\r\ndf &lt;- as.DataFrame(faithful)\r\n# Get basic information about the SparkDataFrame\r\ndf\r\n## SparkDataFrame[eruptions:double, waiting:double]\r\n# Select only the \"eruptions\" column\r\nhead(select(df, df$eruptions))\r\n## \u00a0eruptions\r\n##1 \u00a0\u00a0\u00a0\u00a03.600\r\n##2 \u00a0\u00a0\u00a0\u00a01.800\r\n##3 \u00a0\u00a0\u00a0\u00a03.333\r\n# You can also pass in column name as strings\r\nhead(select(df, \"eruptions\"))\r\n# Filter the SparkDataFrame to only retain rows with wait times shorter than 50 mins\r\nhead(filter(df, df$waiting &lt; 50))\r\n## \u00a0eruptions waiting\r\n##1 \u00a0\u00a0\u00a0\u00a01.750 \u00a0\u00a0\u00a0\u00a0\u00a047\r\n##2 \u00a0\u00a0\u00a0\u00a01.750 \u00a0\u00a0\u00a0\u00a0\u00a047\r\n##3 \u00a0\u00a0\u00a0\u00a01.867 \u00a0\u00a0\u00a0\u00a0\u00a048<\/pre>\n<h3><b>ii. Grouping, Aggregation<\/b><\/h3>\n<p>Basically, to <span class=\"complexword\">aggregate<\/span> data by grouping, data frames support some <span class=\"adverb\">commonly<\/span> used functions. Moreover, we can compute a histogram of the waiting time in the faithful DataSet as below.<br \/>\n<b>For Example<\/b><br \/>\n# We use the `n` operator to count the number of times each waiting time appears<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">head(summarize(groupBy(df, df$waiting), count = n(df$waiting)))\r\n## \u00a0waiting count\r\n##1 \u00a0\u00a0\u00a0\u00a0\u00a070 \u00a0\u00a0\u00a0\u00a04\r\n##2 \u00a0\u00a0\u00a0\u00a0\u00a067 \u00a0\u00a0\u00a0\u00a01\r\n##3 \u00a0\u00a0\u00a0\u00a0\u00a069 \u00a0\u00a0\u00a0\u00a02<\/pre>\n<p># We can also sort the output from the aggregation to get the most common waiting times<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">waiting_counts &lt;- summarize(groupBy(df, df$waiting), count = n(df$waiting))\r\nhead(arrange(waiting_counts, desc(waiting_counts$count)))\r\n## \u00a0\u00a0waiting count\r\n##1 \u00a0\u00a0\u00a0\u00a0\u00a078 \u00a0\u00a0\u00a015\r\n##2 \u00a0\u00a0\u00a0\u00a0\u00a083 \u00a0\u00a0\u00a014\r\n##3 \u00a0\u00a0\u00a0\u00a0\u00a081 \u00a0\u00a0\u00a013<\/pre>\n<h3><b>iii. Operating on Columns<\/b><\/h3>\n<p>Basically, SparkR offers the number of functions. \u00a0Moreover, those can directly apply to columns for data processing and during aggregation. Although, Here below example shows the use of basic arithmetic functions.<br \/>\n<b>For Example<\/b><br \/>\n# Convert waiting time from hours to seconds.<br \/>\n# Note that we can assign this to a new column in the same SparkDataFrame<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">df$waiting_secs &lt;- df$waiting * 60\r\nhead(df)\r\n## \u00a0eruptions waiting waiting_secs\r\n##1 \u00a0\u00a0\u00a0\u00a03.600 \u00a0\u00a0\u00a0\u00a0\u00a079 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a04740\r\n##2 \u00a0\u00a0\u00a0\u00a01.800 \u00a0\u00a0\u00a0\u00a0\u00a054 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a03240\r\n##3 \u00a0\u00a0\u00a0\u00a03.333 \u00a0\u00a0\u00a0\u00a0\u00a074 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a04440<\/pre>\n<h2><b>7. Machine Learning Algorithms in Spark and R<\/b><\/h2>\n<div id=\"attachment_5622\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Machine-Learning-Algorithms-in-SparkR-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5622\" class=\"wp-image-5622 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Machine-Learning-Algorithms-in-SparkR-01.jpg\" alt=\"SparkR\" width=\"800\" height=\"800\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Machine-Learning-Algorithms-in-SparkR-01.jpg 800w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Machine-Learning-Algorithms-in-SparkR-01-150x150.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Machine-Learning-Algorithms-in-SparkR-01-300x300.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Machine-Learning-Algorithms-in-SparkR-01-768x768.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/Machine-Learning-Algorithms-in-SparkR-01-100x100.jpg 100w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/a><p id=\"caption-attachment-5622\" class=\"wp-caption-text\">Machine Learning Algorithms in SparkR<\/p><\/div>\n<p>There are following machine learning algorithms currently, supported by Spark and R<b>.<\/b><\/p>\n<h3><b>i. Classification<\/b><\/h3>\n<p>spark.logit: Logistic Regression<br \/>\nspark.mlp: Multilayer Perceptron (MLP)<br \/>\nspark.naiveBayes: Naive Bayes<br \/>\nspark.svmLinear: Linear Support Vector Machine<\/p>\n<h3><b>ii. Regression<\/b><\/h3>\n<p>spark.survreg: Accelerated Failure Time (AFT) Survival Model<br \/>\nspark.glm or glm: Generalized Linear Model (GLM)<br \/>\nspark.isoreg: Isotonic Regression<\/p>\n<h3><b>iii. Tree<\/b><\/h3>\n<p>spark.gbt: Gradient Boosted Trees for Regression and Classification<br \/>\nspark.randomForest: Random Forest for Regression and Classification<\/p>\n<h3><b>iv. Clustering<\/b><\/h3>\n<p>spark.bisectingKmeans: Bisecting k-means<br \/>\nspark.gaussianMixture: Gaussian Mixture Model (GMM)<br \/>\nspark.kmeans: K-Means<br \/>\nspark.lda: Latent Dirichlet Allocation (LDA)<\/p>\n<h3><b>v. Collaborative Filtering<\/b><\/h3>\n<p>spark.als: Alternating Least Squares (ALS)<\/p>\n<h3><b>vi. Frequent Pattern Mining<\/b><\/h3>\n<p>spark.fpGrowth : FP-growth<\/p>\n<h3><b>vii. Statistics<\/b><\/h3>\n<p>spark.kstest: Kolmogorov-Smirnov Test<br \/>\nBasically, SparkR uses MLlib to train the model. Moreover, It supports a subset of the available R formula operators. For example, model fitting, including \u2018~\u2019, \u2018.\u2019, \u2018:\u2019, \u2018+\u2019, and \u2018-\u2018.<\/p>\n<h3><b>viii. Model persistence<\/b><\/h3>\n<p>Basically, to define well, below example shows how to save\/load an MLlib model by SparkR.<br \/>\n<b>For Example<\/b><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">training &lt;- read.df(\"data\/mllib\/sample_multiclass_classification_data.txt\", source = \"libsvm\")<\/pre>\n<p># Basically, Fit a generalized linear model of family &#8220;gaussian&#8221; with spark.glm<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">df_list &lt;- randomSplit(training, c(7,3), 2)\r\ngaussianDF &lt;- df_list[[1]]\r\ngaussianTestDF &lt;- df_list[[2]]\r\ngaussianGLM &lt;- spark.glm(gaussianDF, label ~ features, family = \"gaussian\")\r\n# Save and then load a fitted MLlib model\r\nmodelPath &lt;- tempfile(pattern = \"ml\", fileext = \".tmp\")\r\nwrite.ml(gaussianGLM, modelPath)\r\ngaussianGLM2 &lt;- read.ml(modelPath)\r\n# Check model summary\r\nsummary(gaussianGLM2)\r\n# Check model prediction\r\ngaussianPredictions &lt;- predict(gaussianGLM2, gaussianTestDF)\r\nhead(gaussianPredictions)\r\nunlink(modelPath)<\/pre>\n<h2><b>8. Conclusion<\/b><\/h2>\n<p>Hence, we have seen, how through SparkR we can use Apache Spark. Basically, we have learned the whole concept regarding the creation of SparkR DataFrames. Moreover, we have seen all the operations of SparkR. Further, we have learned about MLlib Algorithms. Hence, we have learned the whole Concept of Spark and R.<br \/>\nFinally, if you want to ask any query, feel free to ask in the Comment section.<\/p>\n<p><strong><a href=\"https:\/\/en.wikipedia.org\/wiki\/Apache_Spark\">Reference for Spark<\/a><\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Objective Today, in this article, we will learn the whole concept of SparkR. First, we will learn the process of creating SparkR DataFrames. Moreover, we will also learn some SparkR DataFrame operations. Further,&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":42323,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[7111,16605,16597,13165,13168,16604],"class_list":["post-5596","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-spark","tag-introduction-to-sparkr","tag-mllib-in-spark-r","tag-spark-and-r","tag-sparkr","tag-sparkr-in-apache-spark","tag-sparkr-tutorial"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>SparkR in Apache Spark | Operations &amp; Algorithms in Spark and R - DataFlair<\/title>\n<meta name=\"description\" content=\"SparkR Introduction to learn what is SparkR in terms of Spark, how to create sparkr DataFrames, operations on sparkr, machine learning algorithms in sparkr.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/sparkr\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"SparkR in Apache Spark | Operations &amp; Algorithms in Spark and R - DataFlair\" \/>\n<meta property=\"og:description\" content=\"SparkR Introduction to learn what is SparkR in terms of Spark, how to create sparkr DataFrames, operations on sparkr, machine learning algorithms in sparkr.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/sparkr\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2018-01-16T10:19:12+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-11-16T08:25:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"SparkR in Apache Spark | Operations &amp; Algorithms in Spark and R - DataFlair","description":"SparkR Introduction to learn what is SparkR in terms of Spark, how to create sparkr DataFrames, operations on sparkr, machine learning algorithms in sparkr.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/sparkr\/","og_locale":"en_US","og_type":"article","og_title":"SparkR in Apache Spark | Operations &amp; Algorithms in Spark and R - DataFlair","og_description":"SparkR Introduction to learn what is SparkR in terms of Spark, how to create sparkr DataFrames, operations on sparkr, machine learning algorithms in sparkr.","og_url":"https:\/\/data-flair.training\/blogs\/sparkr\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2018-01-16T10:19:12+00:00","article_modified_time":"2018-11-16T08:25:24+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/sparkr\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"SparkR in Apache Spark | Operations &amp; Algorithms in Spark and R","datePublished":"2018-01-16T10:19:12+00:00","dateModified":"2018-11-16T08:25:24+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr\/"},"wordCount":1126,"commentCount":1,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01.jpg","keywords":["Introduction to SparkR","MLlib in Spark R","Spark and R","SparkR","SparkR in Apache Spark","SparkR Tutorial"],"articleSection":["Apache Spark Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/sparkr\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/sparkr\/","url":"https:\/\/data-flair.training\/blogs\/sparkr\/","name":"SparkR in Apache Spark | Operations &amp; Algorithms in Spark and R - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01.jpg","datePublished":"2018-01-16T10:19:12+00:00","dateModified":"2018-11-16T08:25:24+00:00","description":"SparkR Introduction to learn what is SparkR in terms of Spark, how to create sparkr DataFrames, operations on sparkr, machine learning algorithms in sparkr.","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/sparkr\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/sparkr\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/sparkr\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2018\/01\/SparkR-in-Apache-Spark-01.jpg","width":1200,"height":628,"caption":"SparkR in Apache Spark"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/sparkr\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"Apache Spark Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/spark\/"},{"@type":"ListItem","position":3,"name":"SparkR in Apache Spark | Operations &amp; Algorithms in Spark and R"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/5596","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=5596"}],"version-history":[{"count":6,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/5596\/revisions"}],"predecessor-version":[{"id":42325,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/5596\/revisions\/42325"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/42323"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=5596"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=5596"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=5596"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}