Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › Explain Traits, properties and features of RDDs in Apache Spark
- This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
Viewing 1 reply thread
-
AuthorPosts
-
-
September 20, 2018 at 12:49 pm #4937DataFlair TeamSpectator
Need a brief explanation of RDD in Apache Spark. Why RDD is used to process the data ? What are the major features/characteristics of RDD (Resilient Distributed Datasets) ?
-
September 20, 2018 at 1:10 pm #4968DataFlair TeamSpectator
Properties/Traits of RDD:
- Immutable (Read only cant change or modify): Data is safe to share across processes. It can be created or retrieved anytime which makes caching, sharing & replication easy. It is a way to reach consistency in computations.
- Partitioned: It is basic unit of parallelism in RDD. Each partition is logical division of data/records.
- Coarse gained operations: it’s applied to any or all components in datasets through maps or filter or group by operation.
- Action/Transformations: All computations in RDDs are actions or transformations.
- Fault Tolerant: As the name says or include Resilient which means its capability to reconcile, recover or get back all the data (coarse/fine grained & low overhead) using lineage graph.
- Cacheable: It holds data in persistent storage (memory/disk) so that they can be retrieved more quickly on the next request for them.
- Persistence: Option of choosing which storage will be used either in-memory or on-disk.
you can also refer to below blog for more detailed description: Features of RDD.
-
-
AuthorPosts
Viewing 1 reply thread
- You must be logged in to reply to this topic.