Site icon DataFlair

Ways to start SparkR – Using Sparksession or RStudio

1. Objective

A distributed collection of data organized into named columns is SparkDataFrame. Basically, there are two ways to which we can start SparkR. For example, by using Sparksession or Rstudio. In this article, we will learn both the process in detail. Before, ways to start SparkR, we will also learn a brief introduction to SparkDataFrame.

Ways to start SparkR – Sparksession and RStudio

2. SparkDataFrame

A distributed collection of data organized into named columns is SparkDataFrame. Basically, it is as same as a table in a relational database or a data frame in R. Moreover, it has some richer optimizations under the hood.
In addition, we can construct SparkDataFrames from a wide array of sources. For example, structured data files, tables in Hive, external databases. Also can construct from existing local R data frames.

We can start SparkR using two kinds of methods. Either using Sparksession or R studio.

3. Ways to start SparkR

There are several ways to start SparkR, let’s discuss them in detail.

a. Using SparkSession

Basically, SparkSession is the entry point into SparkR. Also connects your R program to a Spark cluster. Although, by using sparkR.session we can easily create a SparkSession. Moreover, it passes in options. Such as the application name, any spark packages depended on, and many more. Afterwards, we can also work with SparkDataFrames via SparkSession. Since we are working from the sparkR shell. Hence it is important that SparkSession should already be created for us. Then we would not need to call sparkR.session.

sparkR.session()

b. Using RStudio

Apart from SparkSession, we can also start SparkR from RStudio. Also, from RStudio, R shell, Rscript or other R IDEs, we can connect our R program to a Spark cluster. Basically, we need to make sure that to start, SPARK_HOME is set in the environment.
Moreover, to check Sys.getenv, load the SparkR package, and call sparkR.session as same as below. Hence, that will check for the Spark installation. Although, if not found, it will be downloaded and cached automatically. Therefore, as an alternative, we can also run install.spark manually.

Furthermore, to call sparkR.session, we could also specify certain Spark driver properties. Although, we can not set these application properties, Runtime Environment programmatically.
Since the driver JVM process would have been started. Basically, in this case, SparkR takes care of this for us. Moreover, we can also set in other ways. By passing them as other configuration properties in the sparkConfig argument to sparkR.session().

if (nchar(Sys.getenv(“SPARK_HOME”)) < 1) {
 Sys.setenv(SPARK_HOME = “/home/spark”)
}
library(SparkR, lib.loc = c(file.path(Sys.getenv(“SPARK_HOME”), “R”, “lib”)))
sparkR.session(master = “local[*]”, sparkConfig = list(spark.driver.memory = “2g”))

c. Spark Driver Properties

In addition, there are many Spark driver Properties. Basically, we can set following Spark driver properties. Although, it is possible by using sparkConfig with sparkR.session from RStudio.

Property Name Property group spark-submit equivalent
spark.master Application Properties –master
spark.yarn.keytab Application Properties –keytab
spark.yarn.principal Application Properties –principal
spark.driver.memory Application Properties –driver-memory
spark.driver.extraClassPath Runtime Environment –driver-class-path
spark.driver.extraJavaOptions Runtime Environment –driver-java-options
spark.driver.extraLibraryPath Runtime Environment –driver-library-path

4. Conclusion

As a result, we have seen ways to start SparkR by Sparksession and RStudio. However, we have tried to cover all the insights regarding same. Despite that, if any query occurs, feel free to ask in the comment section.
A list of best books to Learn Spark.
For reference

Exit mobile version