Comparison between Hive Internal Tables vs External Tables
Curious to know different types of Hive tables and how they are different from each other?
As discussed the basics of Hive tables in Hive Data Models, let us now explore the major difference between hive internal and external tables.
In this article, we are going to discuss the two different types of Hive Table that are Internal table (Managed table) and External table. The article then enlists the differences between Hive Internal tables and External Tables. We will also see different cases where we can use these Hive tables.
So, let’s start with Hive internal tables and external tables.
Don't become Obsolete & get a Pink Slip
Follow DataFlair on Google News & Stay ahead of the game
Introduction to Hive Internal and External tables
Fundamentally, Hive knows two different types of tables: Internal table and the External table. The Internal table is also known as the managed table.
We can identify the internal or External tables using the
DESCRIBE FORMATTED table_name statement in the Hive, which will display either
EXTERNAL_TABLE depending on the table type.
1. Hive Internal Table
Hive owns the data for the internal tables.
It is the default table in Hive. When the user creates a table in Hive without specifying it as external, then by default, an internal table gets created in a specific location in HDFS.
By default, an internal table will be created in a folder path similar to /user/hive/warehouse directory of HDFS. We can override the default location by the location property during table creation.
If we drop the managed table or partition, the table data and the metadata associated with that table will be deleted from the HDFS.
2. Hive External Table
Hive does not manage the data of the External table.
We create an external table for external use as when we want to use the data outside the Hive.
External tables are stored outside the warehouse directory. They can access data stored in sources such as remote HDFS locations or Azure Storage Volumes.
Whenever we drop the external table, then only the metadata associated with the table will get deleted, the table data remains untouched by Hive.
We can create the external table by specifying the EXTERNAL keyword in the Hive create table statement.
Difference between Hive Internal and External Table
Let us now see the difference between both Hive tables. The major differences in the internal and external tables in Hive are:
1. LOAD semantics
The Load semantics varies in both the tables. Let us see the difference in load semantics between the internal table and the external table.
a. Internal Table
When we load data into an internal table, then Hive moves data into the warehouse directory.
Here in this example, we are creating a table ‘internaldemo’. When we load data into the ‘internaldemo’ table, then Hive moves the data to the warehouse directory.
Learn Apache Hive Installation on Ubuntu to run Hive queries.
b. External Table
With the EXTERNAL keyword, Hive knows that it is not managing the table data, so it does not move data to its warehouse directory. Hive does not even check whether the external location at the time it is defined exists or not.
In this example, we are creating an external table, ‘external_demo’ on the specified location that is ‘/home/dataflair/’. On loading the data into the external table, the Hive does not move table data to its warehouse directory.
2. DROP semantics
Like load semantics, drop semantics also varies in both the tables. Let us see the difference in drop semantics between the internal table and the external table.
a. Internal table
Dropping the internal table will delete the table data, as well as the metadata associated with the table.
In this example, we are dropping the managed table ‘internaldemo’. You can see that it will delete both the table metadata as well as the table data.
Table data also gets deleted from the HDFS.
b. External table
Dropping the external table will delete only table metadata. The table content remains untouched.
In this example, we can see that on deleting the external table ‘external demo’, the content of the table is still present in the HDFS location.
3. TRUNCATE support
The TRUNCATE command only works for the internal table.
In this example, we are trying to use the truncate command with ‘externaldemo’ table. We will get an error message that says ‘Cannot truncate non-managed table externaldemo’.
4. ACID Support
ACID/transactional works only for the internal table. They do not work for External Table.
5. Query Result Caching
Query Result Caching that saves the results of an executed Hive query for reuse on subsequent queries only works for the internal table.
Thus these are some differences between the Hive internal table and the external table.
Let us now see when to use which Hive table.
When to use Internal and External table?
1. Hive Internal Table
We can use the internal table in cases:
- When generating temporary tables.
- When required that Hive should manage the lifecycle of the table.
- And when we don’t want table data after deletion.
2. Hive External Table
We can use the external table in cases:
- When we are not creating the table based on the existing table.
- When required to use data outside of Hive. For example, the data files are read and processed by an existing program that does not lock the files.
- When we don’t want to delete the table data completely even after DROP.
- When the data should not be own by Hive.
So, this was all in Hive internal and external tables. I hope you like our explanation.
After reading the article, we can conclude that the Hive tables are of two types- internal table, also known as the managed table, and the External table.
The internal table data gets managed by the Hive. Hive is not responsible for managing data of the External table.
On deleting the Hive internal table, the table data and the metadata both get deleted and on deleting the Hive External table, only the table metadata will delete.
Any queries while working with hive internal and external tables? Ask below.