List the advantage of Parquet file in Apache Spark

This topic has 3 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 3 reply threads

Author

Posts
- September 20, 2018 at 10:17 pm #6451
  
  DataFlair Team
  Spectator
  
  What are the benefits of using parquet file-format in Apache Spark?
- September 20, 2018 at 10:17 pm #6452
  
  DataFlair Team
  Spectator
  
  Parquet is a columnar format supported by many data processing systems. The benifits of having a columnar storage are –
  
  1- Columnar storage limits IO operations.
  
  2- Columnar storage can fetch specific columns that you need to access.
  
  3-Columnar storage consumes less space.
  
  4- Columnar storage gives better-summarized data and follows type-specific encoding.
- September 20, 2018 at 10:17 pm #6453
  
  DataFlair Team
  Spectator
  
  Parquet is an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance.
  
  There are several advantages to columnar formats:
  
  1)Organizing by column allows for better compression, as data is more homogeneous. The space savings are very noticeable at the scale of a Hadoop cluster.
  2)I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
  3)As we store data of the same type in each column, we can use encoding better suited to the modern processors’ pipeline by making instruction branching more predictable.
- September 20, 2018 at 10:17 pm #6454
  
  DataFlair Team
  Spectator
  
  Apache Parquet is an open source column based storage format for Hadoop. Parquet is widely used in the Hadoop world for analytics workloads by many query engines like Hive,Impala and Spark SQL etc.Parquet is efficient and performant in both storage and processing. If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.
  
  Parquet has higher execution speed compared to other standard file formats like Avro,JSON etc and it also consumes less disk space in compare to AVRO and JSON.
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

List the advantage of Parquet file in Apache Spark

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses