Apache Hive Architecture & Components
In our previous blog, we have discussed what is Apache Hive in detail. Now we are going to discuss the Architecture of Apache Hive. We will also cover the different components of Hive in the Hive Architecture. At last, we will provide you with the steps for data processing in Apache Hive in this Hive Architecture tutorial.
2. What is Hive?
Apache Hive is an ETL and Data warehousing tool built on top of Hadoop. It makes job easy for performing operations like
- Analysis of huge datasets
- Ad-hoc queries
- Data encapsulation
3. Hadoop Hive Architecture
The below diagram describes the Architecture of Hive and Hive components. It also describes the flow in which a query is submitted into Hive and finally processed using the MapReduce framework:
Above diagram shows the major components of Apache Hive-
- Hive Clients – Apache Hive supports all application written in languages like C++, Java, Python etc. using JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client application written in a language of their choice.
- Hive Services – Hive provides various services like web Interface, CLI etc. to perform queries.
- Processing framework and Resource Management – Hive internally uses Hadoop MapReduce framework to execute the queries.
- Distributed Storage – As seen above that Hive is built on the top of Hadoop, so it uses the underlying HDFS for the distributed storage.
You can Also learn Different Ways to Configure Hive Metastore.
Now let us discuss the Hive client and Hive services in detail-
3.1. Hive Clients
The Hive supports different types of client applications for performing queries. These clients are categorized into 3 types:
- Thrift Clients – As Apache Hive server is based on Thrift, so it can serve the request from all those languages that support Thrift.
- JDBC Clients – Apache Hive allows Java applications to connect to it using JDBC driver. It is defined in the class apache.hadoop.hive.jdbc.HiveDriver.
- ODBC Clients – ODBC Driver allows applications that support ODBC protocol to connect to Hive. For example JDBC driver, ODBC uses Thrift to communicate with the Hive server.
3.2. Hive Services
Apache Hive provides various services as shown in above diagram. Now, let us look at each in detail:
a) CLI(Command Line Interface) – This is the default shell that Hive provides, in which you can execute your Hive queries and command directly.
b) Web Interface – Hive also provides web based GUI for executing Hive queries and commands.See here different Hive Data types and operators.
c) Hive Server – It is built on Apache Thrift and thus is also called as Thrift server. It allows different clients to submit requests to Hive and retrieve the final result.
d) Hive Deriver – Driver is responsible for receiving the queries submitted Thrift, JDBC, ODBC, CLI, Web UL interface by a Hive client.
- Complier –After that hive driver passes the query to the compiler. Where parsing, type checking, and semantic analysis takes place with the help of schema present in the metastore.
- Optimizer – It generates the optimized logical plan in the form of a DAG (Directed Acyclic Graph) of MapReduce and HDFS tasks.
- Executor – Once compilation and optimization complete, execution engine executes these tasks in the order of their dependencies using Hadoop.
e) Metastore – Metastore is the central repository of Apache Hive metadata in the Hive Architecture. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API. Hive metastore consists of two fundamental units:
- A service that provides metastore access to other Apache Hive services.
- Disk storage for the Hive metadata which is separate from HDFS storage.
4. How to process data with Apache Hive?
Now we will discuss how a typical query flows through the system-
- User Interface (UI) calls the execute interface to the Driver.
- The driver creates a session handle for the query. Then it sends the query to the compiler to generate an execution plan.
- The compiler needs the metadata. So it sends a request for getMetaData. Thus receives the sendMetaData request from Metastore.
- Now compiler uses this metadata to type check the expressions in the query. The compiler generates the plan which is DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. The plan contains map operator trees and a reduce operator tree for map/reduce stages.
- Now execution engine submits these stages to appropriate components. After in each task the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files. Then pass them through the associated operator tree. Once it generates the output, write it to a temporary HDFS file through the serializer. Now temporary file provides the subsequent map/reduce stages of the plan. Then move the final temporary file to the table’s location for DML operations.
- Now for queries, execution engine directly read the contents of the temporary file from HDFS as part of the fetch call from the Driver.
Hence, Hive is a Data Warehousing package built on top of Hadoop used for structure and semi structured data analysis and processing. It provides flexible query language such as HQL for better querying and processing of data. Thus it offers so many features compared to RDBMS which has certain limitations. As you have learned Apache Hive and its Architecture, let us now learn Apache Hive Installation on ubuntu to use the functionality of Apache Hive.
If in case you feel any query related to this Hive Architecture tutorial, so please leave your comments in a section given below.