Resilient Distributed Datasets (RDD) is spark’s core abstraction which is a resilient distributed dataset.
It is an immutable (read-only) distributed collection of objects.
Each dataset in RDD is divided into logical partitions,
which may be computed on different nodes of the cluster.
Including user-defined classes, RDDs may contain any type of Python, Java, or Scala objects.
In 3 ways we can create RDD in Apache Spark:
1. Through distributing collection of objects
2. By loading an external dataset
3. From existing Apache Spark RDDs
1. Using parallelized collection
RDDs are generally created by parallelizing an existing collection
i.e. by taking an existing collection in the program and passing
it to SparkContext’s parallelize() method.
scala > val data = Array(1,2,3,4,5)
scala > val dataRDD = sc.parallelize (data)
scala > dataRDD.count
2. External Datasets
In Spark, a distributed dataset can be formed from any data source supported by Hadoop.
val dataRDD = spark.read.textFile(“F:/BigData/DataFlair/Spark/Posts.xml”).rdd
3. Creating RDD from existing RDD
Transformation is the way to create an RDD from already existing RDD.
Transformation acts as a function that intakes an RDD and produces another resultant RDD.
The input RDD does not get changed,
Some of the operations applied on RDD are: filter, Map, FlatMap
val dataRDD = spark.read.textFile(“F:/Mritunjay/BigData/DataFlair/Spark/Posts.xml”).rdd
val resultRDD = data.filter{line => {line.trim().startsWith(“<row”)}
}
Go to the link, to study above mentioned methods in detail: How to Create RDDs in Apache Spark?