Apache Hive vs Spark SQL: Feature wise comparison


1. Objective

While Apache Hive and Spark SQL perform the same action, retrieving data, each does the task in a different way. However, Hive is planned as an interface or convenience for querying data stored in HDFS. Though, MySQL is planned for online operations requiring many reads and writes. So we will discuss Apache Hive vs Spark SQL on the basis of their feature.

This blog totally aims at differences between Spark SQL vs Hive in Apache Spark. We will also cover the features of both individually. To understand more, we will also focus on the usage area of both. Also, there are several limitations with Hive as well as SQL. We will discuss all in detail to understand the difference between Hive and SparkSQL.

Hive vs Spark SQL

2. Comparison between Apache Hive vs Spark SQL

At first, we will put light on a brief introduction of each. Afterwards, we will compare both on the basis of various features.

2.1. Introduction

Apache Hive:

Apache Hive is built on top of Hadoop. Moreover, It is an open source data warehouse system. Also, helps for analyzing and querying large datasets stored in Hadoop files. At First, we have to write complex Map-Reduce jobs. But, using Hive, we just need to submit merely SQL queries. Users who are comfortable with SQL, Hive is mainly targeted towards them.

Spark SQL:

In Spark, we use Spark SQL for structured data processing. Moreover, We get more information of the structure of data by using SQL.  Also, gives information on computations performed. One can achieve extra optimization in Apache Spark, with this extra information. Although, Interaction with Spark SQL is possible in several ways. Such as DataFrame and the Dataset API.

2.2. Initial release

Apache Hive:

Apache Hive was first released in 2012.

Spark SQL:

While Apache Spark SQL was first released in 2014.

2.3. Current release

Apache Hive:

Currently released on 24 October 2017:  version 2.3.1

Spark SQL:

Currently released on 09 October 2017: version 2.1.2

2.4. License

Apache Hive:

It is open sourced, from Apache Version 2.

Spark SQL:

It is open sourced, through Apache Version 2.

2.5. Implementation language

Apache Hive:

Basically, we can implement Apache Hive on Java language.

Spark SQL:

We can implement Spark SQL on Scala, Java, Python as well as R language.

2.6. Primary database model

Apache Hive:  

Primarily, its database model is Relational DBMS.

Spark SQL:

Primarily, its database model is also Relational DBMS

2.7. Additional database models

Apache Hive:

It supports an additional database model, i.e. Key-value store

Spark SQL:  

As similar as Hive, it also supports Key-value store as additional database model.

2.8. Developer

Apache Hive:

Hive is originally developed by Facebook. But later donated to the Apache Software Foundation, which has maintained it since.

Spark SQL:

It is originally developed by Apache Software Foundation.

2.9. Server operating systems

Apache Hive:

Basically, it supports all Operating Systems with a Java VM.

Spark SQL:

It supports several operating systems. For example Linux OS, X,  and Windows.

2.10. Data Types

Apache Hive:

It has predefined data types. For example, float or date.

Spark SQL:

As similar to Spark SQL, it also has predefined data types. For Example, float or date.

2.11. Support of SQL

Apache Hive:

It possesses SQL-like DML and DDL statements.

Spark SQL:

Like Apache Hive, it also possesses SQL-like DML and DDL statements.

2.12. APIs and other access methods

Apache Hive:

Apache Hive supports JDBC, ODBC, and Thrift.

Spark SQL:

Spark SQL supports only JDBC and ODBC.

2.13. Programming languages

Apache Hive:

We can use several programming languages in Hive. For example C++, Java, PHP, and Python.

Spark SQL:

We can use several programming languages in Spark SQL.  For example Java, Python, R, and Scala. This creates difference between SparkSQL and Hive.

2.14. Partitioning methods

Apache Hive:

It uses data sharding method for storing data on different nodes.

Spark SQL:  

It uses spark core for storing data on different nodes.

2.15. Replication methods

Apache Hive:

There is a selectable replication factor for redundantly storing data on multiple nodes.

Spark SQL:

Basically, for redundantly storing data on multiple nodes, there is a no replication factor in Spark SQL.

2.16. Concurrency

Apache Hive:

Basically, hive supports concurrent manipulation of data.

Spark SQL:

Whereas, spark SQL also supports concurrent manipulation of data.

Let’s see few more difference between Apache Hive vs Spark SQL.

2.17. Durability

Apache Hive:

Basically, it supports for making data persistent.

Spark SQL:

As same as Hive, Spark SQL also support for making data persistent.

2.18. User concepts

Apache Hive:

There are access rights for users, groups as well as roles.

Spark SQL:

There are no access rights for users.

2.19. Usage

Apache Hive:

  • Schema flexibility and evolution.
  • Also, can portion and bucket, tables in Apache Hive.
  • As JDBC/ODBC drivers are available in Hive, we can use it.

Spark SQL:

  • Basically, it performs SQL queries.
  • Through Spark SQL, it is possible to read data from existing Hive installation.
  • We get the result as Dataset/DataFrame if we run Spark SQL with another programming language.

2.20. Limitations

Apache Hive:

  • It does not offer real-time queries and row level updates.
  • Also provides acceptable latency for interactive data browsing.
  • Hive does not support online transaction processing.
  • In Apache Hive, latency for queries is generally very high.

Spark SQL:

  • It does not support union type
  • Although, no provision of error for oversize of varchar type
  • It does not support transactional table
  • However, no support for Char type
  • It does not support time-stamp in Avro table.

3. Conclusion

Hence, we can not say SparkSQL is not a replacement for Hive neither is the other way. As a result, we have seen that SparkSQL is more spark API and developer friendly. Also, SQL makes programming in spark easier. While, Hive’s ability to switch execution engines, is efficient to query huge data sets. Although, we can just say it’s usage is totally depends on our goals. Apart from it, we have discussed we have discussed Usage as well as limitations above. Also discussed complete discussion of Apache Hive vs Spark SQL. So, hopefully, this blog may answer all the questions occurred in mind regarding Apache Hive vs Spark SQL.