- 1. Objective
- 2. Comparison between Apache Hive vs Spark SQL
- 2.1. Introduction
- 2.2. Initial release
- 2.3. Current release
- 2.4. License
- 2.5. Implementation language
- 2.6. Primary database model
- 2.7. Additional database models
- 2.8. Developer
- 2.9. Server operating systems
- 2.10. Data Types
- 2.11. Support of SQL
- 2.12. APIs and other access methods
- 2.13. Programming languages
- 2.14. Partitioning methods
- 2.15. Replication methods
- 2.16. Concurrency
- 2.17. Durability
- 2.18. User concepts
- 2.19. Usage
- 2.20. Limitations
- 3. Conclusion
While Apache Hive and Spark SQL perform the same action, retrieving data, each does the task in a different way. However, Hive is planned as an interface or convenience for querying data stored in HDFS. Though, MySQL is planned for online operations requiring many reads and writes. So we will discuss Apache Hive vs Spark SQL on the basis of their feature.
This blog totally aims at differences between Spark SQL vs Hive in Apache Spark. We will also cover the features of both individually. To understand more, we will also focus on the usage area of both. Also, there are several limitations with Hive as well as SQL. We will discuss all in detail to understand the difference between Hive and SparkSQL.
2. Comparison between Apache Hive vs Spark SQL
At first, we will put light on a brief introduction of each. Afterwards, we will compare both on the basis of various features.
Apache Hive is built on top of Hadoop. Moreover, It is an open source data warehouse system. Also, helps for analyzing and querying large datasets stored in Hadoop files. At First, we have to write complex Map-Reduce jobs. But, using Hive, we just need to submit merely SQL queries. Users who are comfortable with SQL, Hive is mainly targeted towards them.
In Spark, we use Spark SQL for structured data processing. Moreover, We get more information of the structure of data by using SQL. Also, gives information on computations performed. One can achieve extra optimization in Apache Spark, with this extra information. Although, Interaction with Spark SQL is possible in several ways. Such as DataFrame and the Dataset API.
2.2. Initial release
Apache Hive was first released in 2012.
While Apache Spark SQL was first released in 2014.
2.3. Current release
Currently released on 24 October 2017: version 2.3.1
Currently released on 09 October 2017: version 2.1.2
It is open sourced, from Apache Version 2.
It is open sourced, through Apache Version 2.
2.5. Implementation language
Basically, we can implement Apache Hive on Java language.
We can implement Spark SQL on Scala, Java, Python as well as R language.
2.6. Primary database model
Primarily, its database model is Relational DBMS.
Primarily, its database model is also Relational DBMS
2.7. Additional database models
It supports an additional database model, i.e. Key-value store
As similar as Hive, it also supports Key-value store as additional database model.
Hive is originally developed by Facebook. But later donated to the Apache Software Foundation, which has maintained it since.
It is originally developed by Apache Software Foundation.
2.9. Server operating systems
Basically, it supports all Operating Systems with a Java VM.
It supports several operating systems. For example Linux OS, X, and Windows.
2.10. Data Types
It has predefined data types. For example, float or date.
As similar to Spark SQL, it also has predefined data types. For Example, float or date.
2.11. Support of SQL
It possesses SQL-like DML and DDL statements.
Like Apache Hive, it also possesses SQL-like DML and DDL statements.
2.12. APIs and other access methods
Apache Hive supports JDBC, ODBC, and Thrift.
Spark SQL supports only JDBC and ODBC.
2.13. Programming languages
We can use several programming languages in Hive. For example C++, Java, PHP, and Python.
2.14. Partitioning methods
It uses data sharding method for storing data on different nodes.
It uses spark core for storing data on different nodes.
2.15. Replication methods
There is a selectable replication factor for redundantly storing data on multiple nodes.
Basically, for redundantly storing data on multiple nodes, there is a no replication factor in Spark SQL.
Basically, hive supports concurrent manipulation of data.
Whereas, spark SQL also supports concurrent manipulation of data.
Let’s see few more difference between Apache Hive vs Spark SQL.
Basically, it supports for making data persistent.
As same as Hive, Spark SQL also support for making data persistent.
2.18. User concepts
There are access rights for users, groups as well as roles.
There are no access rights for users.
- Schema flexibility and evolution.
- Also, can portion and bucket, tables in Apache Hive.
- As JDBC/ODBC drivers are available in Hive, we can use it.
- Basically, it performs SQL queries.
- Through Spark SQL, it is possible to read data from existing Hive installation.
- We get the result as Dataset/DataFrame if we run Spark SQL with another programming language.
- It does not offer real-time queries and row level updates.
- Also provides acceptable latency for interactive data browsing.
- Hive does not support online transaction processing.
- In Apache Hive, latency for queries is generally very high.
- It does not support union type
- Although, no provision of error for oversize of varchar type
- It does not support transactional table
- However, no support for Char type
- It does not support time-stamp in Avro table.
Hence, we can not say SparkSQL is not a replacement for Hive neither is the other way. As a result, we have seen that SparkSQL is more spark API and developer friendly. Also, SQL makes programming in spark easier. While, Hive’s ability to switch execution engines, is efficient to query huge data sets. Although, we can just say it’s usage is totally depends on our goals. Apart from it, we have discussed we have discussed Usage as well as limitations above. Also discussed complete discussion of Apache Hive vs Spark SQL. So, hopefully, this blog may answer all the questions occurred in mind regarding Apache Hive vs Spark SQL.