Free Online Certification Courses – Learn Today. Lead Tomorrow. › Forums › Apache Spark › List the advantage of Parquet file in Apache Spark
- This topic has 3 replies, 1 voice, and was last updated 6 years ago by DataFlair Team.
-
AuthorPosts
-
-
September 20, 2018 at 10:17 pm #6451DataFlair TeamSpectator
What are the benefits of using parquet file-format in Apache Spark?
-
September 20, 2018 at 10:17 pm #6452DataFlair TeamSpectator
Parquet is a columnar format supported by many data processing systems. The benifits of having a columnar storage are –
1- Columnar storage limits IO operations.
2- Columnar storage can fetch specific columns that you need to access.
3-Columnar storage consumes less space.
4- Columnar storage gives better-summarized data and follows type-specific encoding.
-
September 20, 2018 at 10:17 pm #6453DataFlair TeamSpectator
Parquet is an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance.
There are several advantages to columnar formats:
1)Organizing by column allows for better compression, as data is more homogeneous. The space savings are very noticeable at the scale of a Hadoop cluster.
2)I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
3)As we store data of the same type in each column, we can use encoding better suited to the modern processors’ pipeline by making instruction branching more predictable. -
September 20, 2018 at 10:17 pm #6454DataFlair TeamSpectator
Apache Parquet is an open source column based storage format for Hadoop. Parquet is widely used in the Hadoop world for analytics workloads by many query engines like Hive,Impala and Spark SQL etc.Parquet is efficient and performant in both storage and processing. If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.
Parquet has higher execution speed compared to other standard file formats like Avro,JSON etc and it also consumes less disk space in compare to AVRO and JSON.
-
-
AuthorPosts
- You must be logged in to reply to this topic.