Apache Hive Tutorial – A Single Best Comprehensive Guide for Beginner

DataFlair Team

6 years ago

Basically, for querying and analyzing large datasets stored in Hadoop files we use Apache Hive. However, there are many more concepts of Hive, that all we will discuss in this Apache Hive Tutorial, you can learn about what is Apache Hive.

So, in this Apache Hive Tutorial, we will learn Hive history. Further, we will see why the Hive is used – reasons to learn Hive. Also, we will cover the Hive architecture or components to understand well.

Afterwards, we will also cover its limitations, how does Hive work, Hive vs SparkSQL, and Pig vs Hive vs Hadoop MapReduce.

So, let’s start Hive Tutorial.

What is Apache Hive?

Apache Hive is an open source data warehouse system built on top of Hadoop Haused. Especially, we use it for querying and analyzing large datasets stored in Hadoop files. Moreover, by using Hive we can process structured and semi-structured data in Hadoop.

In other words, it is a data warehouse infrastructure which facilitates querying and managing large datasets which reside in the distributed storage system. Basically, it offers a way to query the data using a SQL-like query language called HiveQL(Hive Query Language).

In addition, a compiler translates HiveQL statements into MapReduce jobs, internally. Further which are submitted to Hadoop framework for execution.

a. Hive is not

Sometimes, few misconceptions occur about Hive. So, let’s clarify that:

We can say it is not a relational database
Also, not a design for OnLine Transaction Processing (OLTP)
Even not a language for real-time queries and row-level update

Why Hive?

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

In this section of Hive tutorial, we discuss – Why should we use Apache Hive technology?

As we know it is mainly used for data querying, analysis, and summarization. Moreover, it helps to improve the developer productivity. However, that comes at the cost of increasing latency and decreasing efficiency.

In other words, Hive is a variant of SQL and a very good one indeed. Although, when compared to SQL systems implemented in databases, Hive stands tall. Hive has many User Defined Functions that makes it easy to contribute to the UDFs.

Also, we can connect Hive queries to various Hadoop packages. Such as RHive, RHipe, and even Apache Mahout. However, when working for complex analytical processing and data formats that are challenging, it greatly helps the developer community.

To be more specific, ‘Data warehouse’ means a system we use for reporting and data analysis. Basically, it refers to inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information and suggesting conclusions.

Moreover, in the different business, science, and social science domains data analysis has multiple aspects and approaches, encompassing diverse techniques under a variety of names.

In addition, it allows users to simultaneously access the data and increases the response time. It means the time a system or functional unit takes to react to a given input. Also, it has a much faster response time than most other types of queries on the same type of huge datasets.

Moreover, without any drop in performance, it is highly flexible as more commodities can easily be added in response to more adding of the cluster of data.

Hive Tutorial – History

Basically, at Facebook, Data Infrastructure Team developed Hive. Especially, to address the requirements at Facebook, they use Hive technology.

Internally, it is very popular with all the users on Facebook. To be very specific, for a wide variety of applications it is being used to run thousands of jobs on the cluster with hundreds of users.

In addition, Hive-Hadoop cluster stores more than 2PB of raw data at Facebook. Moreover, on a daily basis, it loads 15 TB of data regularly.

Also, very important to know that it is being used and developed by a number of companies. Such as Amazon, IBM, Yahoo, Netflix, Financial Industry Regulatory Authority (FINRA) etc.

Hive Architecture

In below diagram Hive tutorial states Hive architecture with its components:

Hive Tutorial – Hive Architecture

There are several different units in this component diagram. Now, let’s describes each unit:

a. User Interface

As we know it is a data warehouse infrastructure software. It can create interaction between user and HDFS. Moreover, there are various user interfaces that Hive supports. They are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).

b. Meta Store

Basically, to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping, it chooses respective database servers.

c. HiveQL Process Engine

Also, we can say HiveQL is same as SQL. Especially, for querying on schema info on the Metastore. Also, for MapReduce program, it is one of the replacements of the traditional approach. Moreover, we can write a query for MapReduce job and process it, instead of writing MapReduce program in Java.

d. Execution Engine

Although, Hive Execution Engine is the conjunction part of HiveQL process Engine and MapReduce. Execution engine processes the query and generates results as same as MapReduce results. Also, it uses the flavor of MapReduce.

e. HDFS or HBase

Basically, to store data into file system Hadoop distributed file system or HBase is the data storage techniques.

How Does Hive Works?

Hive Tutorial – the following diagram depicts the workflow between Hive and Hadoop.

Apache Hive Tutorial – Working of Hive

The following table defines how Hive interacts with Hadoop framework.

Step-1 Execute Query

At very first, the Hive interface ( Command Line or Web UI) sends the query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.

Step-2 Get Plan

Afterwards, the driver takes the help of query compiler which parses the query to check the syntax and query plan or the requirement of the query.

Step-3 Get Metadata

Further, the compiler sends metadata request to Metastore (any database).

Step-4 Send Metadata

After that Metastore sends metadata as a response to the compiler.

Step-5 Send Plan

Then the compiler checks the requirement and resends the plan to the driver. However, the parsing and compiling of a query are complete, Up to here.

Step-6 Execute Plan

Further, the driver sends the execution plan to the execution engine.

Step-7 Execute Job

Then, the process of execution job is a MapReduce job, internally. Also, the execution engine sends the job to JobTracker, which is in name node and it assigns this job to TaskTracker, which is in data node. Moreover, the query executes MapReduce job, here.

Metadata Ops

During the execution, the execution engine can execute metadata operations with Metastore.

Step-8 Fetch Result

While execution is over, the execution engine receives the results from Data nodes.

Step-9 Send Results

After fetching results, execution engine sends those resultant values to the driver.

Step-10 Send Results

At last, the driver sends the results to Hive interfaces.

Features of Hive

In this section of Hive Tutorial, we study Apache Hive features. So, let’s discuss all-

The best feature is it offers data summarization, query, and analysis in much easier manner.
However, to process data without actually storing in HDFS, Hive supports external tables.
Moreover, it fits the low-level interface requirement of Hadoop perfectly.
Also, to improve performance it supports partitioning of data at the level of tables.
While it comes to optimizing logical plans, Hive has a rule-based optimizer available.
Hive is scalable, familiar, and extensible in nature.
For working with HiveQL Knowledge of basic SQL query is enough. We don’t need any knowledge of programming language.
By using Hive, it is possible to process structured data in Hadoop.
Hive makes Querying very simple, as same as SQL.
By using Hive, it is possible to run Ad-hoc queries for the data analysis

Limitation of Hive

Apache Hive Tutorial discuss this following limitation of Hive. Let’s discuss all –

We can not perform real-time queries with Hive. Also, it does not offer row-level updates.
Moreover, for interactive data browsing Hive offers acceptable latency.
Also, we can say Hive is not the right choice for online transaction processing.
While it comes to latency, for Hive queries latency is generally very high.

Apache Hive Tutorial – Usage

Here, we will look at following Hive usages.

We use Hive for Schema flexibility as well as evolution.
Moreover, it is possible to portion and bucket, tables in Apache Hive.
Also, we can use JDBC/ODBC drivers, since they are available in Hive.

Hive vs Spark SQL

In this section of Apache Hive tutorial, we will compare Hive vs Spark SQL in detail.
a. Initial release

Apache Hive

The hive was first released in the year 2012.

Spark SQL

Whereas Spark SQL was first released in the year 2014.
b. Current release

Apache Hive

Currently released on 18 November 2017: version 2.3.2

Spark SQL

Currently released on 09 October 2017: version 2.1.2
c. Developer

Apache Hive

Facebook developed it originally. Further donated to the Apache Software Foundation, that has maintained it since.

Spark SQL

Apache Software Foundation developed it originally.
d. Server operating systems

Apache Hive

However, with a Java VM, it supports all Operating Systems.

Spark SQL

There are many operating systems Spark SQL supports. For example Linux OS, X, and Windows.
e. Data Types

Apache Hive

It attains predefined data types. For example, float or date.

Spark SQL

Like Spark SQL, it also attains predefined data types. For Example, float or date.
f. Support of SQL

Apache Hive

Basically, it possesses SQL-like DML and DDL statements.

Spark SQL

As same as Hive, it also possesses SQL-like DML and DDL statements.

Pig vs Hive vs Hadoop MapReduce

Apache Hive Tutorial – Pig vs Hive vs Hadoop MapReduce

Get a complete differentiation of Pig vs Hive vs Hadoop Mapreduce in this section of Apache Hive tutorial.

a. Language

Hive

It has SQL like Query language.

MapReduce

Also, has compiled language.

Pig

It has the scripting language.

b. Abstraction

Hive

It has a Low level of Abstraction.

MapReduce

Also, has the High level of Abstraction.

Pig

It has the High level of Abstraction.

c. Line of codes

Hive

Comparatively less no. of the line of codes from both MapReduce and Pig.

MapReduce

It has More line of codes.

Pig

Comparatively less no. of the line of codes from MapReduce.

d. Development Efforts

Hive

Comparatively fewer development efforts from both MapReduce and Pig.

MapReduce

More development effort is involved.

Pig

Comparatively less development effort.

e. Code Efficiency

Hive

Code efficiency is relatively less.

MapReduce

It has high Code efficiency.

Pig

Code efficiency is relatively less.

So, this was all in Apache Hive Tutorial. Hope you like our explanation.

Conclusion – Hive Tutorial

Hence, in this Apache Hive tutorial, we have seen the concept of Apache Hive. It includes Hive architecture, limitations of Hive, advantages, why Hive is needed, Hive History, Hive vs Spark SQL and Pig vs Hive vs Hadoop MapReduce.

Still, if you have to ask any query about this Apache Hive tutorial, feel free to ask through the comment section.