What is RDD in Apache Spark?

This topic has 1 reply, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 3:45 pm #5610
  
  DataFlair Team
  Spectator
  
  What is RDD – Resilient Distributed Dataset in Spark?
  How RDD make Spark a feature rich framework?
- September 20, 2018 at 3:47 pm #5616
  
  DataFlair Team
  Spectator
  
  RDD – Resilient Distributed Dataset
  Resilient: If data is lost, it will be recreated automatically (fault tolerant</srong>)
  Distributed: Data distributedly stored/processed
  Dataset: Data can come from different data stores.
  
  Abstraction is a technique for arranging complexity of computer systems.
  
  It is the basic abstraction in Apache Spark
  
  Read that data in form of RDD
  |
  Transform operation generate another RDD
  |
  Action operation
  |
  Generate Final Result
  
  RDD are the fundamental unit of Spark, which allows parallel processing of dataset. It is Immutable, recomputable, fault tolerant. During spark programming, we perform operations only on RDD.
  
  Partitions:
  
  These are fragments (building-block) of RDD, allows Spark to execute in parallel. Data is stored distributedly, and are pointed by the individual partition of RDD. Partitions are distributed across the cluster. They are the logical division of data. One task processes one partition at a time. All Input, Intermediate and output data is represented as partitions. RDD data is just the collection of partitions.
  
  RDDs can be memory resident. Based on requirement we can cache the RDD, i.e can keep the RDD In-memory.
  
  Types of RDD operation:
  .Transformation:
  Create a new RDD from an existing one. e.g map, filter, join, etc.
  
  .Action:
  It returns the result or writes it to storage. e.g count, reduce, collect, etc.
  
  Two ways to Create an RDD
  Parallelizing collections in driver program
  val data =Array(1,2,3,4,5)
  val distData = sc.parallelize(data)
  distData—> name of RDD
  
  Loading/Reading External Datasets
  val distFile = sc.textFile(“path of file”)
  distFile—> name of RDD
  
  For more details visit:
  RDD in Spark
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

What is RDD in Apache Spark?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses