Explain Traits, properties and features of RDDs in Apache Spark

This topic has 1 reply, 1 voice, and was last updated 7 years, 10 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 12:49 pm #4937
  
  DataFlair Team
  Spectator
  
  Need a brief explanation of RDD in Apache Spark. Why RDD is used to process the data ? What are the major features/characteristics of RDD (Resilient Distributed Datasets) ?
- September 20, 2018 at 1:10 pm #4968
  DataFlair Team
  Spectator
  Properties/Traits of RDD:
  - Immutable (Read only cant change or modify): Data is safe to share across processes. It can be created or retrieved anytime which makes caching, sharing & replication easy. It is a way to reach consistency in computations.
  - Partitioned: It is basic unit of parallelism in RDD. Each partition is logical division of data/records.
  - Coarse gained operations: it’s applied to any or all components in datasets through maps or filter or group by operation.
  - Action/Transformations: All computations in RDDs are actions or transformations.
  - Fault Tolerant: As the name says or include Resilient which means its capability to reconcile, recover or get back all the data (coarse/fine grained & low overhead) using lineage graph.
  - Cacheable: It holds data in persistent storage (memory/disk) so that they can be retrieved more quickly on the next request for them.
  - Persistence: Option of choosing which storage will be used either in-memory or on-disk.
  you can also refer to below blog for more detailed description: Features of RDD.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.