What is DataFrame ?

This topic has 1 reply, 1 voice, and was last updated 7 years, 10 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 10:24 pm #6464
  
  DataFlair Team
  Spectator
  
  DataFrame
  > DataFrame is similar to tables in the relational database.
  > It extends the RDD Model. i.e. It’s Wrapper above RDD, Basic RDD do not have schema but DataFrame has a schema associated with it so it can process the data more efficiently than basic RDD.
  > DataFrame contains an RDD of Row objects each represents a record.
  > DataFrame provides additional ability to run the SQL / HQL query.
  > Since DataFrame is an extension of native RDD, we can do all the operations like map(), flatMap(), collect() on DataFrame.
  > In earlier version of spark (up to 1.2) DataFrame is also known as Schema RDD
  
  For detailed information on DataFrame, follow: Spark SQL DataFrame Tutorial – An Introduction to DataFrame
- September 20, 2018 at 10:24 pm #6465
  
  DataFlair Team
  Spectator
  
  Dataframe is a combination of both RDD and schema. Dataframe contains logical plan. Logical plan is divided into three stages, 1)Parsed logical plan, 2)Analyzed logical plan, 3)Optimized logical. This optimized logical plan is converted to physical plan to execute on RDD abstraction.
  
  We can also say that Dataframe is data structure to represent structured data. Dataframe is a single abstraction to represent structured data. Dataframe is introduced in 1.3 version.
  
  At high level, we can perform two types operations, either you use Dataframe API to process the data or use sql style to process data(In sql style, first we have to register dataframe as a table and start querying using this table name).
  
  For every row inside a dataframe, it will represented as Row object
  
  By default dataframe uses tungsten memory management and kryo serialization.
  
  Dataframe is faster than RDD because of Catalyst and tungsten memory management
  
  Not every problem can be solved with dataframe. Sometimes, we have to fallback to RDD. Such cases are: To do error handling , parse the records, to process multiline json records. We cannot process multi-line json records using spark-sql. In these case, we will use RDD and then after apply some transformation, we converted it to Dataframe (using toDF or createDataFrame api).
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

What is DataFrame ?

About DataFlair

Trending Courses in Indore

Trending Courses in Bangalore

Trending Courses in Chennai

Trending Courses in Pune

Trending Courses in Hyderabad

Trending Courses in Delhi NCR