Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › RDDs vs DataFrames
- This topic has 1 reply, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 12:24 pm #4832DataFlair TeamSpectator
What is the difference between rdd and dataframes?
-
September 20, 2018 at 12:26 pm #4841DataFlair TeamSpectator
DataFrame: A Data Frame is used for storing data into tables. It is equivalent to a table in a relational database but with richer optimization. It is a data abstraction and domain-specific language (DSL) applicable on structure and semi-structured data. It is distributed collection of data in the form of named column and row. It has a matrix-like structure whose column may be different types (numeric, logical, factor, or character ).we can say data frame has two-dimensional array like structure where each column contains the value of one variable and row contains one set of values for each column. It combines feature of list and matrices.
For more details about DataFrame, please refer: DataFrame in Spark
RDD is the representation of set of records, immutable collection of objects with distributed computing. RDD is large collection of data or RDD is an array of reference of partitioned objects. Each and every datasets in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. self-recovered/recomputed in the case of failure. The dataset could be data loaded externally by the users which can be in the form of JSON file, CSV file, text file or database via JDBC with no specific data structure.
For more details about RDD, please refer: RDD in Spark
For the detailed comparison between RDD vs DataFrame, follow: RDD vs DataFrame vs DataSet
-
-
AuthorPosts
- You must be logged in to reply to this topic.