What is DataFrame ?

Viewing 1 reply thread
  • Author
    Posts
    • #6464
      DataFlair TeamDataFlair Team
      Spectator

      DataFrame
      > DataFrame is similar to tables in the relational database.
      > It extends the RDD Model. i.e. It’s Wrapper above RDD, Basic RDD do not have schema but DataFrame has a schema associated with it so it can process the data more efficiently than basic RDD.
      > DataFrame contains an RDD of Row objects each represents a record.
      > DataFrame provides additional ability to run the SQL / HQL query.
      > Since DataFrame is an extension of native RDD, we can do all the operations like map(), flatMap(), collect() on DataFrame.
      > In earlier version of spark (up to 1.2) DataFrame is also known as Schema RDD

      For detailed information on DataFrame, follow: Spark SQL DataFrame Tutorial – An Introduction to DataFrame

    • #6465
      DataFlair TeamDataFlair Team
      Spectator

      Dataframe is a combination of both RDD and schema. Dataframe contains logical plan. Logical plan is divided into three stages, 1)Parsed logical plan, 2)Analyzed logical plan, 3)Optimized logical. This optimized logical plan is converted to physical plan to execute on RDD abstraction.

      We can also say that Dataframe is data structure to represent structured data. Dataframe is a single abstraction to represent structured data. Dataframe is introduced in 1.3 version.

      At high level, we can perform two types operations, either you use Dataframe API to process the data or use sql style to process data(In sql style, first we have to register dataframe as a table and start querying using this table name).

      For every row inside a dataframe, it will represented as Row object

      By default dataframe uses tungsten memory management and kryo serialization.

      Dataframe is faster than RDD because of Catalyst and tungsten memory management

      Not every problem can be solved with dataframe. Sometimes, we have to fallback to RDD. Such cases are: To do error handling , parse the records, to process multiline json records. We cannot process multi-line json records using spark-sql. In these case, we will use RDD and then after apply some transformation, we converted it to Dataframe (using toDF or createDataFrame api).

Viewing 1 reply thread
  • You must be logged in to reply to this topic.