What is RDD in Apache Spark?

Viewing 1 reply thread
  • Author
    Posts
    • #5610
      DataFlair TeamDataFlair Team
      Spectator

      What is RDD – Resilient Distributed Dataset in Spark?
      How RDD make Spark a feature rich framework?

    • #5616
      DataFlair TeamDataFlair Team
      Spectator

      RDD – Resilient Distributed Dataset
      Resilient: If data is lost, it will be recreated automatically (fault tolerant</srong>)
      Distributed: Data distributedly stored/processed
      Dataset: Data can come from different data stores.

      Abstraction is a technique for arranging complexity of computer systems.

      It is the basic abstraction in Apache Spark

      Read that data in form of RDD
      |
      Transform operation generate another RDD
      |
      Action operation
      |
      Generate Final Result

      RDD are the fundamental unit of Spark, which allows parallel processing of dataset. It is Immutable, recomputable, fault tolerant. During spark programming, we perform operations only on RDD.

      Partitions:

      These are fragments (building-block) of RDD, allows Spark to execute in parallel. Data is stored distributedly, and are pointed by the individual partition of RDD. Partitions are distributed across the cluster. They are the logical division of data. One task processes one partition at a time. All Input, Intermediate and output data is represented as partitions. RDD data is just the collection of partitions.

      RDDs can be memory resident. Based on requirement we can cache the RDD, i.e can keep the RDD In-memory.

      Types of RDD operation:
      .Transformation:
      Create a new RDD from an existing one. e.g map, filter, join, etc.

      .Action:
      It returns the result or writes it to storage. e.g count, reduce, collect, etc.

      Two ways to Create an RDD
      Parallelizing collections in driver program
      val data =Array(1,2,3,4,5)
      val distData = sc.parallelize(data)
      distData—> name of RDD

      Loading/Reading External Datasets
      val distFile = sc.textFile(“path of file”)
      distFile—> name of RDD

      For more details visit:
      RDD in Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.