Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › How can we create RDD in Apache Spark?
- This topic has 1 reply, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 9:45 pm #6403DataFlair TeamSpectator
List the ways of creating RDDs in Spark.
Describe how RDDs are created in Apache Spark. -
September 20, 2018 at 9:45 pm #6404DataFlair TeamSpectator
Resilient Distributed Datasets (RDD) is spark’s core abstraction which is a resilient distributed dataset.
It is an immutable (read-only) distributed collection of objects.
Each dataset in RDD is divided into logical partitions,
which may be computed on different nodes of the cluster.
Including user-defined classes, RDDs may contain any type of Python, Java, or Scala objects.In 3 ways we can create RDD in Apache Spark:
1. Through distributing collection of objects
2. By loading an external dataset
3. From existing Apache Spark RDDs1. Using parallelized collection
RDDs are generally created by parallelizing an existing collection
i.e. by taking an existing collection in the program and passing
it to SparkContext’s parallelize() method.scala > val data = Array(1,2,3,4,5)
scala > val dataRDD = sc.parallelize (data)
scala > dataRDD.count2. External Datasets
In Spark, a distributed dataset can be formed from any data source supported by Hadoop.
val dataRDD = spark.read.textFile(“F:/BigData/DataFlair/Spark/Posts.xml”).rdd
3. Creating RDD from existing RDD
Transformation is the way to create an RDD from already existing RDD.
Transformation acts as a function that intakes an RDD and produces another resultant RDD.
The input RDD does not get changed,
Some of the operations applied on RDD are: filter, Map, FlatMapval dataRDD = spark.read.textFile(“F:/Mritunjay/BigData/DataFlair/Spark/Posts.xml”).rdd
val resultRDD = data.filter{line => {line.trim().startsWith(“<row”)}
}Go to the link, to study above mentioned methods in detail: How to Create RDDs in Apache Spark?
-
-
AuthorPosts
- You must be logged in to reply to this topic.