Explain the RDD properties.

Viewing 1 reply thread
  • Author
    Posts
    • #5432
      DataFlair TeamDataFlair Team
      Spectator

      Explain the RDD properties.

    • #5435
      DataFlair TeamDataFlair Team
      Spectator
        <li style=”list-style-type: none”>
      • RDD (Resilient Distributed Dataset) is a basic abstraction in Apache Spark.
      • RDD is an immutable, partitioned collection of elements on the cluster which can be operated in parallel.
      • Each RDD is characterized by five main properties :
      • Below operations are lineage operations.1. List or Set of partitions.
        2. List of dependencies on other (parent) RDD
        3. A function to compute each partition

        Below operations are used for optimization during execution.

        4. Optional preferred location [i.e. block location of an HDFS file] [it’s about data locality] 
        5. Optional partitioned info [i.e. Hash-Partition for Key/Value pair –> When data shuffled how data will be traveled]

        Examples :
        #HadoopRDD :

      • HadoopRDD provides core functionality for reading data stored in Hadoop (HDFSHBase, Amazon S3..) using the older MapReduce API (org.apache.hadoop.mapred)
      • Properties of HadoopRDD :


      1. List or Set of partitions: One per HDFS block
      2. List of dependencies on parent RDD: None
      3. A function to compute each partition: read respective HDFS block
      4. Optional Preferred location: HDFS block location
      5. Optional partitioned info: None

      #FilteredRDD :

        <li style=”list-style-type: none”>
      • Properties of FilteredRDD:


      1. List or Set of partitions: No. of partitions same as parent RDD
      2. List of dependencies on parent RDD: ‘one-to-one’ as parent (same as parent)
      3. A function to compute each partition: compute parent and then filter it
      4. Optional Preferred location: None (Ask Parent)
      5. Optional partitioned info: None

      Find features of RDD in RDD Features in Spark
      For detailed information on RDD tour on RDD in Spark

Viewing 1 reply thread
  • You must be logged in to reply to this topic.