Apache Hive Architecture & Components

Keeping you updated with latest technology trends, Join DataFlair on Telegram

1. Objective

In our previous blog, we have discussed what is Apache Hive in detail. Now we are going to discuss the Architecture of Apache Hive. We will also cover the different components of Hive in the Hive Architecture. At last, we will provide you with the steps for data processing in Apache Hive in this Hive Architecture tutorial.

Apache Hive Architecture & Components

Apache Hive Architecture & Components

2. What is Hive?

Apache Hive is an ETL and Data warehousing tool built on top of Hadoop. It makes job easy for performing operations like

  • Analysis of huge datasets
  • Ad-hoc queries
  • Data encapsulation

Learn: Hive Operators – A Complete Tutorial for Hive Built-in Operators

If these professionals can make a switch to Big Data, so can you:
Rahul Doddamani Story - DataFlair
Rahul Doddamani
Java → Big Data Consultant, JDA
Follow on
Mritunjay Singh Success Story - DataFlair
Mritunjay Singh
PeopleSoft → Big Data Architect, Hexaware
Follow on
Rahul Doddamani Success Story - DataFlair
Rahul Doddamani
Big Data Consultant, JDA
Follow on
I got placed, scored 100% hike, and transformed my career with DataFlair
Enroll now
Richa Tandon Success Story - DataFlair
Richa Tandon
Support → Big Data Engineer, IBM
Follow on
DataFlair Web Services
You could be next!
Enroll now

3. Hadoop Hive Architecture

Hadoop Hive Architecture and Hive Components

Hadoop Hive Architecture and Hive Components

The below diagram describes the Architecture of Hive and Hive components. It also describes the flow in which a query is submitted into Hive and finally processed using the MapReduce framework:

Above diagram shows the major components of Apache Hive-

  • Hive Clients – Apache Hive supports all application written in languages like C++, Java, Python etc. using JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client application written in a language of their choice.
  • Hive Services – Hive provides various services like web Interface, CLI etc. to perform queries.
  • Processing framework and Resource Management – Hive internally uses Hadoop MapReduce framework to execute the queries.
  • Distributed Storage – As seen above that Hive is built on the top of Hadoop, so it uses the underlying HDFS for the distributed storage.

You can Also learn Different Ways to Configure Hive Metastore.
Now let us discuss the Hive client and Hive services in detail-

3.1. Hive Clients

The Hive supports different types of client applications for performing queries. These clients are categorized into 3 types:

  • Thrift Clients – As Apache Hive server is based on Thrift, so it can serve the request from all those languages that support Thrift.
  • JDBC Clients – Apache Hive allows Java applications to connect to it using JDBC driver. It is defined in the class apache.hadoop.hive.jdbc.HiveDriver.
  • ODBC Clients – ODBC Driver allows applications that support ODBC protocol to connect to Hive. For example JDBC driver, ODBC uses Thrift to communicate with the Hive server.

Learn: Apache Hive Installation on Ubuntu – A Hive Tutorial

3.2. Hive Services

Apache Hive provides various services as shown in above diagram. Now, let us look at each in detail:
a) CLI(Command Line Interface) – This is the default shell that Hive provides, in which you can execute your Hive queries and command directly.
b) Web Interface – Hive also provides web based GUI for executing Hive queries and commands.See here different Hive Data types and operators.
c) Hive Server – It is built on Apache Thrift and thus is also called as Thrift server. It allows different clients to submit requests to Hive and retrieve the final result.
d) Hive Deriver – Driver is responsible for receiving the queries submitted Thrift, JDBC, ODBC, CLI, Web UL interface by a Hive client.

  • Complier –After that hive driver passes the query to the compiler. Where parsing, type checking, and semantic analysis takes place with the help of schema present in the metastore.
  • Optimizer – It generates the optimized logical plan in the form of a DAG (Directed Acyclic Graph) of MapReduce and HDFS tasks.
  • Executor – Once compilation and optimization complete, execution engine executes these tasks in the order of their dependencies using Hadoop.
Hadoop Quiz

e) Metastore – Metastore is the central repository of Apache Hive metadata in the Hive Architecture. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API. Hive metastore consists of two fundamental units:

  • A service that provides metastore access to other Apache Hive services.
  • Disk storage for the Hive metadata which is separate from HDFS storage.

4. How to process data with Apache Hive?

Apache Hive data processing workflow.

DataFlow in Hive

Now we will discuss how a typical query flows through the system-

  • User Interface (UI) calls the execute interface to the Driver.
  • The driver creates a session handle for the query. Then it sends the query to the compiler to generate an execution plan.
  • The compiler needs the metadata. So it sends a request for getMetaData. Thus receives the sendMetaData request from Metastore.
  • Now compiler uses this metadata to type check the expressions in the query. The compiler generates the plan which is DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. The plan contains map operator trees and a reduce operator tree for map/reduce stages.
  • Now execution engine submits these stages to appropriate components. After in each task the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files. Then pass them through the associated operator tree. Once it generates the output, write it to a temporary HDFS file through the serializer. Now temporary file provides the subsequent map/reduce stages of the plan. Then move the final temporary file to the table’s location for DML operations.
  • Now for queries, execution engine directly read the contents of the temporary file from HDFS as part of the fetch call from the Driver.

Learn: Hive Partitions, Types of Hive Partitioning with Examples

5. Conclusion

Hence, Hive is a Data Warehousing package built on top of Hadoop used for structure and semi structured data analysis and processing. It provides flexible query language such as HQL for better querying and processing of data. Thus it offers so many features compared to RDBMS which has certain limitations. As you have learned Apache Hive and its Architecture, let us now learn Apache Hive Installation on ubuntu to use the functionality of Apache Hive.
If in case you feel any query related to this Hive Architecture tutorial, so please leave your comments in a section given below.
See Also-

1 Response

  1. Ankit says:

    In the architecture diagram there is a component of Driver “Optimizer”, but same is not mentioned in “DataFlow in hive “.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.