Site icon DataFlair

Apache Hive Tutorial – A Single Best Comprehensive Guide for Beginner

Apache Hive Tutorial

Apache Hive Tutorial- Hive introduction

Basically, for querying and analyzing large datasets stored in Hadoop files we use Apache Hive. However, there are many more concepts of Hive, that all we will discuss in this Apache Hive Tutorial, you can learn about what is Apache Hive.

So, in this Apache Hive Tutorial, we will learn Hive history. Further, we will see why the Hive is used – reasons to learn Hive. Also, we will cover the Hive architecture or components to understand well.

Afterwards, we will also cover its limitations, how does Hive work, Hive vs SparkSQL, and Pig vs Hive vs Hadoop MapReduce.

So, let’s start Hive Tutorial.

What is Apache Hive?

Apache Hive is an open source data warehouse system built on top of Hadoop Haused. Especially, we use it for querying and analyzing large datasets stored in Hadoop files. Moreover, by using Hive we can process structured and semi-structured data in Hadoop.

In other words, it is a data warehouse infrastructure which facilitates querying and managing large datasets which reside in the distributed storage system. Basically, it offers a way to query the data using a SQL-like query language called HiveQL(Hive Query Language).

In addition, a compiler translates HiveQL statements into MapReduce jobs, internally. Further which are submitted to Hadoop framework for execution.

a. Hive is not

Sometimes, few misconceptions occur about Hive. So, let’s clarify that:

Why Hive?

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

In this section of Hive tutorial, we discuss – Why should we use Apache Hive technology?

As we know it is mainly used for data querying, analysis, and summarization. Moreover, it helps to improve the developer productivity. However, that comes at the cost of increasing latency and decreasing efficiency.

In other words, Hive is a variant of SQL and a very good one indeed. Although, when compared to SQL systems implemented in databases, Hive stands tall. Hive has many User Defined Functions that makes it easy to contribute to the UDFs.

Also, we can connect Hive queries to various Hadoop packages. Such as RHive, RHipe, and even Apache Mahout. However, when working for complex analytical processing and data formats that are challenging, it greatly helps the developer community.

To be more specific, ‘Data warehouse’ means a system we use for reporting and data analysis. Basically, it refers to inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information and suggesting conclusions.

Moreover, in the different business, science, and social science domains data analysis has multiple aspects and approaches, encompassing diverse techniques under a variety of names.

In addition, it allows users to simultaneously access the data and increases the response time. It means the time a system or functional unit takes to react to a given input. Also, it has a much faster response time than most other types of queries on the same type of huge datasets.

Moreover, without any drop in performance, it is highly flexible as more commodities can easily be added in response to more adding of the cluster of data.

Hive Tutorial – History 

Basically, at Facebook, Data Infrastructure Team developed Hive. Especially, to address the requirements at Facebook, they use Hive technology.

Internally, it is very popular with all the users on Facebook. To be very specific, for a wide variety of applications it is being used to run thousands of jobs on the cluster with hundreds of users.

In addition, Hive-Hadoop cluster stores more than 2PB of raw data at Facebook. Moreover, on a daily basis, it loads 15 TB of data regularly.

Also, very important to know that it is being used and developed by a number of companies. Such as  Amazon, IBM, Yahoo, Netflix, Financial Industry Regulatory Authority (FINRA) etc.

Hive Architecture

In below diagram Hive tutorial states Hive architecture with its components:

Hive Tutorial – Hive Architecture

There are several different units in this component diagram. Now, let’s describes each unit:

a. User Interface

As we know it is a data warehouse infrastructure software. It can create interaction between user and HDFS. Moreover, there are various user interfaces that Hive supports. They are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).

b. Meta Store

Basically, to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping, it chooses respective database servers.

c. HiveQL Process Engine

Also, we can say HiveQL is same as SQL. Especially, for querying on schema info on the Metastore. Also, for MapReduce program, it is one of the replacements of the traditional approach. Moreover, we can write a query for MapReduce job and process it, instead of writing MapReduce program in Java.

d. Execution Engine

Although, Hive Execution Engine is the conjunction part of HiveQL process Engine and MapReduce. Execution engine processes the query and generates results as same as MapReduce results. Also, it uses the flavor of MapReduce.

e. HDFS or HBase

Basically, to store data into file system Hadoop distributed file system or HBase is the data storage techniques.

How Does Hive Works?

Hive Tutorial – the following diagram depicts the workflow between Hive and Hadoop.

Apache Hive Tutorial – Working of Hive

The following table defines how Hive interacts with Hadoop framework.

Step-1 Execute Query

At very first, the Hive interface ( Command Line or Web UI) sends the query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.

Step-2 Get Plan

Afterwards, the driver takes the help of query compiler which parses the query to check the syntax and query plan or the requirement of the query.

Step-3  Get Metadata

Further, the compiler sends metadata request to Metastore (any database).

Step-4 Send Metadata

After that Metastore sends metadata as a response to the compiler.

Step-5 Send Plan

Then the compiler checks the requirement and resends the plan to the driver. However, the parsing and compiling of a query are complete, Up to here.

Step-6 Execute Plan

Further, the driver sends the execution plan to the execution engine.

Step-7 Execute Job

Then, the process of execution job is a MapReduce job, internally. Also, the execution engine sends the job to JobTracker, which is in name node and it assigns this job to TaskTracker, which is in data node. Moreover, the query executes MapReduce job, here.

During the execution, the execution engine can execute metadata operations with Metastore.

Step-8 Fetch Result

While execution is over, the execution engine receives the results from Data nodes.

Step-9 Send Results

After fetching results, execution engine sends those resultant values to the driver.

Step-10 Send Results

At last, the driver sends the results to Hive interfaces.

Features of Hive

In this section of Hive Tutorial, we study Apache Hive features. So, let’s discuss all-

Limitation of Hive

Apache Hive Tutorial discuss this following limitation of Hive. Let’s discuss all –

Apache Hive Tutorial – Usage

Here, we will look at following Hive usages.

Hive vs Spark SQL

In this section of Apache Hive tutorial, we will compare Hive vs Spark SQL in detail.
a. Initial release

The hive was first released in the year 2012.

Whereas Spark SQL was first released in the year 2014.
b. Current release

Currently released on 18 November 2017: version 2.3.2

Currently released on 09 October 2017: version 2.1.2
c. Developer

Facebook developed it originally. Further donated to the Apache Software Foundation, that has maintained it since.

Apache Software Foundation developed it originally.
d. Server operating systems

However, with a Java VM, it supports all Operating Systems.

There are many operating systems Spark SQL supports. For example Linux OS, X, and Windows.
e. Data Types

It attains predefined data types. For example, float or date.

Like Spark SQL, it also attains predefined data types. For Example, float or date.
f. Support of SQL

Basically, it possesses SQL-like DML and DDL statements.

As same as Hive, it also possesses SQL-like DML and DDL statements.

Pig vs Hive vs Hadoop MapReduce

Apache Hive Tutorial – Pig vs Hive vs Hadoop MapReduce

Get a complete differentiation of Pig vs Hive vs Hadoop Mapreduce in this section of Apache Hive tutorial.

a. Language

It has SQL like Query language.

Also, has compiled language.

It has the scripting language.

b. Abstraction

It has a Low level of Abstraction.

Also, has the High level of Abstraction.

It has the High level of Abstraction.

c. Line of codes

Comparatively less no. of the line of codes from both MapReduce and Pig.

It has More line of codes.

Comparatively less no. of the line of codes from MapReduce.

d. Development Efforts

Comparatively fewer development efforts from both MapReduce and Pig.

More development effort is involved.

Comparatively less development effort.

e. Code Efficiency

Code efficiency is relatively less.

It has high Code efficiency.

Code efficiency is relatively less.

So, this was all in Apache Hive Tutorial. Hope you like our explanation.

Conclusion – Hive Tutorial

Hence, in this Apache Hive tutorial, we have seen the concept of Apache Hive. It includes Hive architecture, limitations of Hive, advantages, why Hive is needed, Hive History, Hive vs Spark SQL and Pig vs Hive vs Hadoop MapReduce.

Still, if you have to ask any query about this Apache Hive tutorial, feel free to ask through the comment section.

Exit mobile version