Comparison between Hive Internal Tables vs External Tables
Curious to know different types of Hive tables and how they are different from each other?
As discussed the basics of Hive tables in Hive Data Models, let us now explore the major difference between hive internal and external tables.
In this article, we are going to discuss the two different types of Hive Table that are Internal table (Managed table) and External table. The article then enlists the differences between Hive Internal tables and External Tables. We will also see different cases where we can use these Hive tables.
So, let’s start with Hive internal tables and external tables.
Introduction to Hive Internal and External tables
Fundamentally, Hive knows two different types of tables: Internal table and the External table. The Internal table is also known as the managed table.
We can identify the internal or External tables using the DESCRIBE FORMATTED table_name
statement in the Hive, which will display either MANAGED_TABLE
or EXTERNAL_TABLE
depending on the table type.
1. Hive Internal Table
Hive owns the data for the internal tables.
It is the default table in Hive. When the user creates a table in Hive without specifying it as external, then by default, an internal table gets created in a specific location in HDFS.
By default, an internal table will be created in a folder path similar to /user/hive/warehouse directory of HDFS. We can override the default location by the location property during table creation.
If we drop the managed table or partition, the table data and the metadata associated with that table will be deleted from the HDFS.
2. Hive External Table
Hive does not manage the data of the External table.
We create an external table for external use as when we want to use the data outside the Hive.
External tables are stored outside the warehouse directory. They can access data stored in sources such as remote HDFS locations or Azure Storage Volumes.
Whenever we drop the external table, then only the metadata associated with the table will get deleted, the table data remains untouched by Hive.
We can create the external table by specifying the EXTERNAL keyword in the Hive create table statement.
Difference between Hive Internal and External Table
Let us now see the difference between both Hive tables. The major differences in the internal and external tables in Hive are:
1. LOAD semantics
The Load semantics varies in both the tables. Let us see the difference in load semantics between the internal table and the external table.
a. Internal Table
When we load data into an internal table, then Hive moves data into the warehouse directory.
For example:
Here in this example, we are creating a table ‘internaldemo’. When we load data into the ‘internaldemo’ table, then Hive moves the data to the warehouse directory.
Now, loading data into the internal table created above.
On describing the table, we see that the table data is moved to the Hive warehouse directory.
b. External Table
With the EXTERNAL keyword, Hive knows that it is not managing the table data, so it does not move data to its warehouse directory. Hive does not even check whether the external location at the time it is defined exists or not.
For example:
In this example, we are creating an external table, ‘external_demo’ on the specified location that is ‘/home/dataflair/’. On loading the data into the external table, the Hive does not move table data to its warehouse directory.
Now, loading data to the external table created above.
On browsing the table, we can see that the Hive table data is not moved to the Hive warehouse directory. It is stored in the location specified while creating a table.
2. DROP semantics
Like load semantics, drop semantics also varies in both tables. Let us see the difference in drop semantics between the internal table and the external table.
a. Internal table
Dropping the internal table will delete the table data, as well as the metadata associated with the table.
For example:
In this example, we are dropping the managed table ‘internaldemo’. You can see that it will delete both the table metadata as well as the table data.
Table data also gets deleted from the HDFS.
b. External table
Dropping the external table will delete only table metadata. The table content remains untouched.
For example:
In this example, we can see that on deleting the external table ‘external demo’, the content of the table is still present in the HDFS location.
3. TRUNCATE support
The TRUNCATE command only works for the internal table.
For example:
In this example, we are trying to use the truncate command with ‘externaldemo’ table. We will get an error message that says ‘Cannot truncate non-managed table externaldemo’.
4. ACID Support
ACID/transactional works only for the internal table. They do not work for External Table.
5. Query Result Caching
Query Result Caching that saves the results of an executed Hive query for reuse on subsequent queries only works for the internal table.
Thus there are some differences between the Hive internal table and the external table.
Let us now see when to use which Hive table.
When to use the Internal and External table?
1. Hive Internal Table
We can use the internal table in cases:
- When generating temporary tables.
- When required that Hive should manage the lifecycle of the table.
- And when we don’t want table data after deletion.
2. Hive External Table
We can use the external table in cases:
- When we are not creating the table based on the existing table.
- When required to use data outside of Hive. For example, the data files are read and processed by an existing program that does not lock the files.
- When we don’t want to delete the table data completely even after DROP.
- When the data should not be own by Hive.
So, this was all in Hive internal and external tables. I hope you like our explanation.
Conclusion
After reading the article, we can conclude that the Hive tables are of two types- internal table, also known as the managed table, and the External table.
The internal table data gets managed by the Hive. Hive is not responsible for managing data of the External table.
On deleting the Hive internal table, the table data and the metadata both get deleted and on deleting the Hive External table, only the table metadata will delete.
Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google
Can i retrieve managed table data after deleting?
No, you can not.
Dear Sir, I have 2 queries which are as follows :-
If I alter the default Location(/user/hive/warehouse/) of Internal Table by using “Location” keyword with path, then that newly altered table will be treated/behave as Internal Table or External Table?
If you change the location of internal table (managed table), then also it will be treated as internal table.
Thanks, It helped a lot 🙂 By the way , I am new on Apache Hive. Are we able to create procedure or temporary tables on the system ?
“External Tables – External table behaves differently. In this, we can control the creation and deletion of the data. ” – Please correct it . As we cannot control the creation and deletion of the data.
Please correct me if i am wrong. TIA
Dear Author
I am working with an web application having Oracle database. I have technical knowledge of “Data Ware House” (Cubes and Dimension) .I am just new to Hadoop, I want to use Analytics part of Hadoop. Where should I start?
I have configured Oracle Big Data Lite VM. Can you help me ? Can you share your expert opinion?
Can an external table without specifying location be created and if dropped files will be deleted or not?
What does this line about external tables mean?
‘They can access data stored in sources such as remote HDFS locations or Azure Storage Volumes.’
Thanks for the explanation about internal and external table it is very helpful. I wonder in the drop section, you mentioned that for the external table after table be dropped the data still exist in HDFS. So the overall process is: create external table -> copy data into the HIVE external table from data source -> data will be stored in HDFS -> drop external table -> data still exists in HDFS and original data source(if I did not delete it). Am I understand it correctly?
And even though I delete the data from original source, since the data already be copied to HDFS so it still exists in HIVE system?