What is Resilient Distributed Dataset (RDD) in Apache Spark

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What is Resilient Distributed Dataset (RDD) in Apache Spark

Viewing 2 reply threads
  • Author
    • #5005
      DataFlair TeamDataFlair Team

      Explain RDD, how it provides abstraction in Spark and make spark operator rich ?

    • #5006
      DataFlair TeamDataFlair Team

      RDD in Apache Spark is the representation of set of records, it is immutable collection of objects with distributed computing. RDD is the large collection of data or an array of reference of partitioned objects. Each and every datasets in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. self-recovered / recomputed in the case of failure. The dataset could be data loaded externally by the users which can be in the form of JSON file, CSV file, text file or database via JDBC with no specific data structure.

      RDD is Lazily Evaluated i.e. it is memorized or called when required or needed, which saves lots of time. RDD is a read only, partitioned collection of data. RDDs can be created through deterministic operations or on stable storage or from other RDDs. It can also be generated by parallelizing an existing collection in your application or referring a dataset in an external storage system. It is cacheable. As it operates on data over multiple jobs in computations such as logistic regression, k-means clustering, PageRank algorithms, which makes it reuse or share data among multiple jobs.

      To learn more about the RDD follow: RDD Tutorial

      To learn how to create RDD and perform various operations follow: RDD Quick Start Guide

    • #5007
      DataFlair TeamDataFlair Team

      It is the fundamental data structure of Apache Spark and provides core abstraction. It is a collection of immutable objects which computes on different nodes of the cluster. It is resilient as well as lazy in nature apart from being statically typed.

      RDDs support two kinds of operations.
      1. Transformations – It applies some function on a RDD and creates a new RDD. It does not modify the original RDD. Also, the new RDD keeps a pointer to it’s parent RDD. When a transformation is called Spark does not execute it immediately , instead it creates a lineage (a track of all the transformations that has to be applied on that RDD including from where it has to read the data)
      2. Actions – It is an operation that triggers computations and returns a value.

      RDDs are divided into smaller chunks called partitions (logical chunks of data), when some actions are executed, a task is launched per partition. The number of partitions are directly responsible for parallelism. Spark automatically decides the number of partitions that an RDD has to be divided into but we can control it using repartition or coalesce transformations. These partitions are then distributed across the nodes in the cluster.

      There are three ways to create an RDD in Spark – Data from an external file or stable storage, from an existing RDD and, parallelizing collections in the driver program.

      One of the most important characteristics of RDD is Caching. We can cache RDD in memory by calling rdd.cache() , which then loads the partitions into the memory of the node that holds it. This improves the performance to a great extent.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.