RDDs vs DataFrames

Viewing 1 reply thread
  • Author
    Posts
    • #4832
      DataFlair Team
      Moderator

      What is the difference between rdd and dataframes?

    • #4841
      DataFlair Team
      Moderator

      DataFrame: A Data Frame is used for storing data into tables. It is equivalent to a table in a relational database but with richer optimization. It is a data abstraction and domain-specific language (DSL) applicable on structure and semi-structured data. It is distributed collection of data in the form of named column and row. It has a matrix-like structure whose column may be different types (numeric, logical, factor, or character ).we can say data frame has two-dimensional array like structure where each column contains the value of one variable and row contains one set of values for each column. It combines feature of list and matrices.

      For more details about DataFrame, please refer: DataFrame in Spark

      RDD is the representation of set of records, immutable collection of objects with distributed computing. RDD is large collection of data or RDD is an array of reference of partitioned objects. Each and every datasets in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. self-recovered/recomputed in the case of failure. The dataset could be data loaded externally by the users which can be in the form of JSON file, CSV file, text file or database via JDBC with no specific data structure.

      For more details about RDD, please refer: RDD in Spark

      For the detailed comparison between RDD vs DataFrame, follow: RDD vs DataFrame vs DataSet

Viewing 1 reply thread
  • You must be logged in to reply to this topic.