Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › What is RDD in Apache Spark?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 3:45 pm #5610DataFlair TeamSpectator
What is RDD – Resilient Distributed Dataset in Spark?
How RDD make Spark a feature rich framework? -
September 20, 2018 at 3:47 pm #5616DataFlair TeamSpectator
RDD – Resilient Distributed Dataset
Resilient: If data is lost, it will be recreated automatically (fault tolerant</srong>)
Distributed: Data distributedly stored/processed
Dataset: Data can come from different data stores.Abstraction is a technique for arranging complexity of computer systems.
It is the basic abstraction in Apache Spark
Read that data in form of RDD
|
Transform operation generate another RDD
|
Action operation
|
Generate Final ResultRDD are the fundamental unit of Spark, which allows parallel processing of dataset. It is Immutable, recomputable, fault tolerant. During spark programming, we perform operations only on RDD.
Partitions:
These are fragments (building-block) of RDD, allows Spark to execute in parallel. Data is stored distributedly, and are pointed by the individual partition of RDD. Partitions are distributed across the cluster. They are the logical division of data. One task processes one partition at a time. All Input, Intermediate and output data is represented as partitions. RDD data is just the collection of partitions.
RDDs can be memory resident. Based on requirement we can cache the RDD, i.e can keep the RDD In-memory.
Types of RDD operation:
.Transformation:
Create a new RDD from an existing one. e.g map, filter, join, etc..Action:
It returns the result or writes it to storage. e.g count, reduce, collect, etc.Two ways to Create an RDD
Parallelizing collections in driver program
val data =Array(1,2,3,4,5)
val distData = sc.parallelize(data)
distData—> name of RDDLoading/Reading External Datasets
val distFile = sc.textFile(“path of file”)
distFile—> name of RDDFor more details visit:
RDD in Spark
-
-
AuthorPosts
- You must be logged in to reply to this topic.