Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › How to create RDD in Apache Spark?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 2:47 pm #5294DataFlair TeamSpectator
How to create RDD in Apache Spark?
-
September 20, 2018 at 2:47 pm #5296DataFlair TeamSpectator
There are three ways to create RDD
(1) By Parallelizing collections in driver program
(2) By loading an external dataset
(3) Creating RDD from already existing RDDs.
Create RDD By Parallelizing collections :
Parallelized collections are created by calling parallelize() method on an existing collection in driver program.val rdd1 = Array(1,2,3,4,5) val rdd2 = sc.parallelize(rdd1)
OR
val myList = sc.parallelize(List(1 to 1000), 5) where 5 is the number of partitions [If we do not specify then default partition is 1
Create by loading an external Dataset
In Spark, distributed dataset can be formed from any data source supported by Hadoop, including the local file system, HDFS, Cassandra, HBase etc. In this, the data is loaded from the external dataset. To create text file RDD, we can use SparkContext’s textFile method. It takes URL of the file and read it as a collection of line. URL can be a local path on the machine or a hdfs://, s3n://, etc. Use SparkSession.read to access an instance of DataFrameReader. DataFrameReader supports many file formats-
i) csv (String path)
import org.apache.spark.sql.SparkSession def main(args: Array[String]):Unit = { object DataFormat { val spark = SparkSession.builder.appName("AvgAnsTime").master("local").getOrCreate() val dataRDD = spark.read.csv("path/of/csv/file").rdd
ii) json (String path)
val dataRDD = spark.read.json("path/of/json/file").rdd
iii) textFile (String path)
val dataRDD = spark.read.textFile("path/of/text/file").rdd
Creating RDD from existing RDD:
Transformation mutates one RDD into another RDD, thus transformation is the way to create an RDD from already existing RDD.val words=spark.sparkContext.parallelize(Seq("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog")) val wordPair = words.map(w => (w.charAt(0), w)) wordPair.foreach(println)
For detailed description on creating RDD read How to create RDD in Apache Spark
-
-
AuthorPosts
- You must be logged in to reply to this topic.