How to create RDD in Apache Spark?

This topic has 1 reply, 1 voice, and was last updated 7 years, 10 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 2:47 pm #5294
  
  DataFlair Team
  Spectator
  
  How to create RDD in Apache Spark?
- September 20, 2018 at 2:47 pm #5296
  DataFlair Team
  Spectator
  There are three ways to create RDD
  (1) By Parallelizing collections in driver program
  (2) By loading an external dataset
  (3) Creating RDD from already existing RDDs.
  
  Create RDD By Parallelizing collections :
  
  Parallelized collections are created by calling parallelize() method on an existing collection in driver program.
```
val rdd1 = Array(1,2,3,4,5)
val rdd2 = sc.parallelize(rdd1)
```
  OR
```
val myList = sc.parallelize(List(1 to 1000), 5) where 5 is the number of partitions
 [If we do not specify then default partition is 1
```
  Create by loading an external Dataset
  
  In Spark, distributed dataset can be formed from any data source supported by Hadoop, including the local file system, HDFS, Cassandra, HBase etc. In this, the data is loaded from the external dataset. To create text file RDD, we can use SparkContext’s textFile method. It takes URL of the file and read it as a collection of line. URL can be a local path on the machine or a hdfs://, s3n://, etc. Use SparkSession.read to access an instance of DataFrameReader. DataFrameReader supports many file formats-
  
  i) csv (String path)
```
import org.apache.spark.sql.SparkSession
def main(args: Array[String]):Unit = {
object DataFormat {
val spark =  SparkSession.builder.appName("AvgAnsTime").master("local").getOrCreate()
val dataRDD = spark.read.csv("path/of/csv/file").rdd
```
  ii) json (String path)
  
  val dataRDD = spark.read.json("path/of/json/file").rdd
  
  iii) textFile (String path)
  
  val dataRDD = spark.read.textFile("path/of/text/file").rdd
  
  Creating RDD from existing RDD:
  
  Transformation mutates one RDD into another RDD, thus transformation is the way to create an RDD from already existing RDD.
```
val words=spark.sparkContext.parallelize(Seq("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
 "dog"))
val wordPair = words.map(w => (w.charAt(0), w))
wordPair.foreach(println)
```
  For detailed description on creating RDD read How to create RDD in Apache Spark
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

How to create RDD in Apache Spark?

About DataFlair

Trending Courses in Indore

Trending Courses in Bangalore

Trending Courses in Chennai

Trending Courses in Pune

Trending Courses in Hyderabad

Trending Courses in Delhi NCR