What are the advantages of DataFrame in Apache Spark?

Viewing 2 reply threads
  • Author
    Posts
    • #6386
      DataFlair Team
      Moderator

      what are the features of dataframe in Spark?
      List out the characteristics of DataFrame in Apache Spark.

    • #6387
      DataFlair Team
      Moderator

      Introduction
      DataFrames are the distributed collection of data. In DataFrame, data is organized into named columns. It is conceptually similar to a table in a relational database.
      we can construct DataFrames from a wide array of sources. Such as structured data files, tables in Hive, external databases, or existing RDDs.

      As same as RDDs, DataFrames are evaluated lazily(Lazy Evaluation). In other words, computation only happens when an action (e.g. display result, save output) is required.

      Out of the box, DataFrame supports reading data from the most popular formats, including JSON files, Parquet files, Hive tables. Also, can read from distributed file systems (HDFS), local file systems, cloud storage (S3), and external relational database systems through JDBC. In addition, through Spark SQL’s external data sources API, DataFrames can be extended to support any third-party data formats or sources. Existing third-party extensions already include Avro, CSV, ElasticSearch, and Cassandra.

      There is much more to know about DataFrames. Refer link: Spark SQL DataFrame Tutorial – An Introduction to DataFrame

    • #6388
      DataFlair Team
      Moderator

      <div class=”threadauthor”>

      pratapajay
      <small>Member</small>

      </div>
      <div class=”threadpost”>
      <div class=”post”>

      DataFrame = Framing the data (of course we are framing it like relational table for better performance)
      A DataFrame is a distributed collection of data organised in row/column manner. Conceptually it is like a relational database. We can create DataFrame from different types of data(Hive data, JSON, CSV, structured files, relation database, RDD(provided we can map the data to a schema))
      We can create a temporary table/view out of DataFrame and run SQL query on this. DataFrame consists of data and schema together so we can run the SQL query to get faster results.
      It is also evaluated lazily(lazy Evaluation) like RDD for optimized use of resources.

      go to the link, for the complete introduction to DataFrames. DataFrame

      </div>
      </div>

Viewing 2 reply threads
  • You must be logged in to reply to this topic.