Ways to start SparkR – Using Sparksession or RStudio
1. Objective
A distributed collection of data organized into named columns is SparkDataFrame. Basically, there are two ways to which we can start SparkR. For example, by using Sparksession or Rstudio. In this article, we will learn both the process in detail. Before, ways to start SparkR, we will also learn a brief introduction to SparkDataFrame.
2. SparkDataFrame
A distributed collection of data organized into named columns is SparkDataFrame. Basically, it is as same as a table in a relational database or a data frame in R. Moreover, it has some richer optimizations under the hood.
In addition, we can construct SparkDataFrames from a wide array of sources. For example, structured data files, tables in Hive, external databases. Also can construct from existing local R data frames.
We can start SparkR using two kinds of methods. Either using Sparksession or R studio.
3. Ways to start SparkR
There are several ways to start SparkR, let’s discuss them in detail.
a. Using SparkSession
Basically, SparkSession is the entry point into SparkR. Also connects your R program to a Spark cluster. Although, by using sparkR.session we can easily create a SparkSession. Moreover, it passes in options. Such as the application name, any spark packages depended on, and many more. Afterwards, we can also work with SparkDataFrames via SparkSession. Since we are working from the sparkR shell. Hence it is important that SparkSession should already be created for us. Then we would not need to call sparkR.session.
sparkR.session()
b. Using RStudio
Apart from SparkSession, we can also start SparkR from RStudio. Also, from RStudio, R shell, Rscript or other R IDEs, we can connect our R program to a Spark cluster. Basically, we need to make sure that to start, SPARK_HOME is set in the environment.
Moreover, to check Sys.getenv, load the SparkR package, and call sparkR.session as same as below. Hence, that will check for the Spark installation. Although, if not found, it will be downloaded and cached automatically. Therefore, as an alternative, we can also run install.spark manually.
Furthermore, to call sparkR.session, we could also specify certain Spark driver properties. Although, we can not set these application properties, Runtime Environment programmatically.
Since the driver JVM process would have been started. Basically, in this case, SparkR takes care of this for us. Moreover, we can also set in other ways. By passing them as other configuration properties in the sparkConfig argument to sparkR.session().
if (nchar(Sys.getenv(“SPARK_HOME”)) < 1) {
 Sys.setenv(SPARK_HOME = “/home/spark”)
}
library(SparkR, lib.loc = c(file.path(Sys.getenv(“SPARK_HOME”), “R”, “lib”)))
sparkR.session(master = “local[*]”, sparkConfig = list(spark.driver.memory = “2g”))
c. Spark Driver Properties
In addition, there are many Spark driver Properties. Basically, we can set following Spark driver properties. Although, it is possible by using sparkConfig with sparkR.session from RStudio.
Property Name | Property group | spark-submit equivalent |
spark.master | Application Properties | –master |
spark.yarn.keytab | Application Properties | –keytab |
spark.yarn.principal | Application Properties | –principal |
spark.driver.memory | Application Properties | –driver-memory |
spark.driver.extraClassPath | Runtime Environment | –driver-class-path |
spark.driver.extraJavaOptions | Runtime Environment | –driver-java-options |
spark.driver.extraLibraryPath | Runtime Environment | –driver-library-path |
4. Conclusion
As a result, we have seen ways to start SparkR by Sparksession and RStudio. However, we have tried to cover all the insights regarding same. Despite that, if any query occurs, feel free to ask in the comment section.
A list of best books to Learn Spark.
For reference
If you are Happy with DataFlair, do not forget to make us happy with your positive feedback on Google