Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › What is DataFrame ?
- This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 10:24 pm #6464DataFlair TeamSpectator
DataFrame
> DataFrame is similar to tables in the relational database.
> It extends the RDD Model. i.e. It’s Wrapper above RDD, Basic RDD do not have schema but DataFrame has a schema associated with it so it can process the data more efficiently than basic RDD.
> DataFrame contains an RDD of Row objects each represents a record.
> DataFrame provides additional ability to run the SQL / HQL query.
> Since DataFrame is an extension of native RDD, we can do all the operations like map(), flatMap(), collect() on DataFrame.
> In earlier version of spark (up to 1.2) DataFrame is also known as Schema RDDFor detailed information on DataFrame, follow: Spark SQL DataFrame Tutorial – An Introduction to DataFrame
-
September 20, 2018 at 10:24 pm #6465DataFlair TeamSpectator
Dataframe is a combination of both RDD and schema. Dataframe contains logical plan. Logical plan is divided into three stages, 1)Parsed logical plan, 2)Analyzed logical plan, 3)Optimized logical. This optimized logical plan is converted to physical plan to execute on RDD abstraction.
We can also say that Dataframe is data structure to represent structured data. Dataframe is a single abstraction to represent structured data. Dataframe is introduced in 1.3 version.
At high level, we can perform two types operations, either you use Dataframe API to process the data or use sql style to process data(In sql style, first we have to register dataframe as a table and start querying using this table name).
For every row inside a dataframe, it will represented as Row object
By default dataframe uses tungsten memory management and kryo serialization.
Dataframe is faster than RDD because of Catalyst and tungsten memory management
Not every problem can be solved with dataframe. Sometimes, we have to fallback to RDD. Such cases are: To do error handling , parse the records, to process multiline json records. We cannot process multi-line json records using spark-sql. In these case, we will use RDD and then after apply some transformation, we converted it to Dataframe (using toDF or createDataFrame api).
-
-
AuthorPosts
- You must be logged in to reply to this topic.