Apache Hive Tutorial – A Single Best Comprehensive Guide for Beginner
Basically, for querying and analyzing large datasets stored in Hadoop files we use Apache Hive. However, there are many more concepts of Hive, that all we will discuss in this Apache Hive Tutorial, you can learn about what is Apache Hive.
So, in this Apache Hive Tutorial, we will learn Hive history. Further, we will see why the Hive is used – reasons to learn Hive. Also, we will cover the Hive architecture or components to understand well.
Afterwards, we will also cover its limitations, how does Hive work, Hive vs SparkSQL, and Pig vs Hive vs Hadoop MapReduce.
So, let’s start Hive Tutorial.
What is Apache Hive?
Apache Hive is an open source data warehouse system built on top of Hadoop Haused. Especially, we use it for querying and analyzing large datasets stored in Hadoop files. Moreover, by using Hive we can process structured and semi-structured data in Hadoop.
In other words, it is a data warehouse infrastructure which facilitates querying and managing large datasets which reside in the distributed storage system. Basically, it offers a way to query the data using a SQL-like query language called HiveQL(Hive Query Language).
In addition, a compiler translates HiveQL statements into MapReduce jobs, internally. Further which are submitted to Hadoop framework for execution.
a. Hive is not
Sometimes, few misconceptions occur about Hive. So, let’s clarify that:
- We can say it is not a relational database
- Also, not a design for OnLine Transaction Processing (OLTP)
- Even not a language for real-time queries and row-level update
Why Hive?
In this section of Hive tutorial, we discuss – Why should we use Apache Hive technology?
As we know it is mainly used for data querying, analysis, and summarization. Moreover, it helps to improve the developer productivity. However, that comes at the cost of increasing latency and decreasing efficiency.
In other words, Hive is a variant of SQL and a very good one indeed. Although, when compared to SQL systems implemented in databases, Hive stands tall. Hive has many User Defined Functions that makes it easy to contribute to the UDFs.
Also, we can connect Hive queries to various Hadoop packages. Such as RHive, RHipe, and even Apache Mahout. However, when working for complex analytical processing and data formats that are challenging, it greatly helps the developer community.
To be more specific, ‘Data warehouse’ means a system we use for reporting and data analysis. Basically, it refers to inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information and suggesting conclusions.
Moreover, in the different business, science, and social science domains data analysis has multiple aspects and approaches, encompassing diverse techniques under a variety of names.
In addition, it allows users to simultaneously access the data and increases the response time. It means the time a system or functional unit takes to react to a given input. Also, it has a much faster response time than most other types of queries on the same type of huge datasets.
Moreover, without any drop in performance, it is highly flexible as more commodities can easily be added in response to more adding of the cluster of data.
Hive Tutorial – HistoryÂ
Basically, at Facebook, Data Infrastructure Team developed Hive. Especially, to address the requirements at Facebook, they use Hive technology.
Internally, it is very popular with all the users on Facebook. To be very specific, for a wide variety of applications it is being used to run thousands of jobs on the cluster with hundreds of users.
In addition, Hive-Hadoop cluster stores more than 2PB of raw data at Facebook. Moreover, on a daily basis, it loads 15 TB of data regularly.
Also, very important to know that it is being used and developed by a number of companies. Such as  Amazon, IBM, Yahoo, Netflix, Financial Industry Regulatory Authority (FINRA) etc.
Hive Architecture
In below diagram Hive tutorial states Hive architecture with its components:
There are several different units in this component diagram. Now, let’s describes each unit:
a. User Interface
As we know it is a data warehouse infrastructure software. It can create interaction between user and HDFS. Moreover, there are various user interfaces that Hive supports. They are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).
b. Meta Store
Basically, to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping, it chooses respective database servers.
c. HiveQL Process Engine
Also, we can say HiveQL is same as SQL. Especially, for querying on schema info on the Metastore. Also, for MapReduce program, it is one of the replacements of the traditional approach. Moreover, we can write a query for MapReduce job and process it, instead of writing MapReduce program in Java.
d. Execution Engine
Although, Hive Execution Engine is the conjunction part of HiveQL process Engine and MapReduce. Execution engine processes the query and generates results as same as MapReduce results. Also, it uses the flavor of MapReduce.
e. HDFS or HBase
Basically, to store data into file system Hadoop distributed file system or HBase is the data storage techniques.
How Does Hive Works?
Hive Tutorial – the following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework.
Step-1 Execute Query
At very first, the Hive interface ( Command Line or Web UI) sends the query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
Step-2 Get Plan
Afterwards, the driver takes the help of query compiler which parses the query to check the syntax and query plan or the requirement of the query.
Step-3Â Get Metadata
Further, the compiler sends metadata request to Metastore (any database).
Step-4 Send Metadata
After that Metastore sends metadata as a response to the compiler.
Step-5 Send Plan
Then the compiler checks the requirement and resends the plan to the driver. However, the parsing and compiling of a query are complete, Up to here.
Step-6 Execute Plan
Further, the driver sends the execution plan to the execution engine.
Step-7 Execute Job
Then, the process of execution job is a MapReduce job, internally. Also, the execution engine sends the job to JobTracker, which is in name node and it assigns this job to TaskTracker, which is in data node. Moreover, the query executes MapReduce job, here.
- Metadata Ops
During the execution, the execution engine can execute metadata operations with Metastore.
Step-8 Fetch Result
While execution is over, the execution engine receives the results from Data nodes.
Step-9 Send Results
After fetching results, execution engine sends those resultant values to the driver.
Step-10 Send Results
At last, the driver sends the results to Hive interfaces.
Features of Hive
In this section of Hive Tutorial, we study Apache Hive features. So, let’s discuss all-
- The best feature is it offers data summarization, query, and analysis in much easier manner.
- However, to process data without actually storing in HDFS, Hive supports external tables.
- Moreover, it fits the low-level interface requirement of Hadoop perfectly.
- Also, to improve performance it supports partitioning of data at the level of tables.
- While it comes to optimizing logical plans, Hive has a rule-based optimizer available.
- Hive is scalable, familiar, and extensible in nature.
- For working with HiveQL Knowledge of basic SQL query is enough. We don’t need any knowledge of programming language.
- By using Hive, it is possible to process structured data in Hadoop.
- Hive makes Querying very simple, as same as SQL.
- By using Hive, it is possible to run Ad-hoc queries for the data analysis
Limitation of Hive
Apache Hive Tutorial discuss this following limitation of Hive. Let’s discuss all –
- We can not perform real-time queries with Hive. Also, it does not offer row-level updates.
- Moreover, Â for interactive data browsing Hive offers acceptable latency.
- Also, we can say Hive is not the right choice for online transaction processing.
- While it comes to latency, for Hive queries latency is generally very high.
Apache Hive Tutorial – Usage
Here, we will look at following Hive usages.
- We use Hive for Schema flexibility as well as evolution.
- Moreover, it is possible to portion and bucket, tables in Apache Hive.
- Also, we can use JDBC/ODBC drivers, since they are available in Hive.
Hive vs Spark SQL
In this section of Apache Hive tutorial, we will compare Hive vs Spark SQL in detail.
a. Initial release
- Apache Hive
The hive was first released in the year 2012.
- Spark SQL
Whereas Spark SQL was first released in the year 2014.
b. Current release
- Apache Hive
Currently released on 18 November 2017: version 2.3.2
- Spark SQL
Currently released on 09 October 2017: version 2.1.2
c. Developer
- Apache Hive
Facebook developed it originally. Further donated to the Apache Software Foundation, that has maintained it since.
- Spark SQL
Apache Software Foundation developed it originally.
d. Server operating systems
- Apache Hive
However, with a Java VM, it supports all Operating Systems.
- Spark SQL
There are many operating systems Spark SQL supports. For example Linux OS, X, and Windows.
e. Data Types
- Apache Hive
It attains predefined data types. For example, float or date.
- Spark SQL
Like Spark SQL, it also attains predefined data types. For Example, float or date.
f. Support of SQL
- Apache Hive
Basically, it possesses SQL-like DML and DDL statements.
- Spark SQL
As same as Hive, it also possesses SQL-like DML and DDL statements.
Pig vs Hive vs Hadoop MapReduce
Get a complete differentiation of Pig vs Hive vs Hadoop Mapreduce in this section of Apache Hive tutorial.
a. Language
- Hive
It has SQL like Query language.
- MapReduce
Also, has compiled language.
- Pig
It has the scripting language.
b. Abstraction
- Hive
It has a Low level of Abstraction.
- MapReduce
Also, has the High level of Abstraction.
- Pig
It has the High level of Abstraction.
c. Line of codes
- Hive
Comparatively less no. of the line of codes from both MapReduce and Pig.
- MapReduce
It has More line of codes.
- Pig
Comparatively less no. of the line of codes from MapReduce.
d. Development Efforts
- Hive
Comparatively fewer development efforts from both MapReduce and Pig.
- MapReduce
More development effort is involved.
- Pig
Comparatively less development effort.
e. Code Efficiency
- Hive
Code efficiency is relatively less.
- MapReduce
It has high Code efficiency.
- Pig
Code efficiency is relatively less.
So, this was all in Apache Hive Tutorial. Hope you like our explanation.
Conclusion – Hive Tutorial
Hence, in this Apache Hive tutorial, we have seen the concept of Apache Hive. It includes Hive architecture, limitations of Hive, advantages, why Hive is needed, Hive History, Hive vs Spark SQL and Pig vs Hive vs Hadoop MapReduce.
Still, if you have to ask any query about this Apache Hive tutorial, feel free to ask through the comment section.
Did you like our efforts? If Yes, please give DataFlair 5 Stars on Google
Data flair usually provides index section at left side of the page but from yesterday index is not showing so course topics are not displaying so kindly help to resolve the issue.
Hi Gaurav
Apologize for that. Now we have designed a new GUI for our readers. Also, the sidebar is clearly visible. You can check any of the topic related to the course from the sidebar. We have designed the home page for each technology. You must check those –
https://data-flair.training/blogs/