Comparison between Hive Internal Tables vs External Tables

Keeping you updated with latest technology trends, Join DataFlair on Telegram

1. Objective

In our previous Apache Hive blog, we have discussed Hive Data ModelsTable, Partition, Bucket in detail. In this blog, we are going to discuss the two types of Hive Table such as Internal Table (Managed Table) and External table. At last, we will also cover the difference between Hive Internal tables vs External Tables in this Hive tutorial.

So, let’s start Hive Internal Tables vs External Tables.

Comparison between Hive Internal Tables vs External Tables

Comparison between Hive Internal Tables vs External Tables

Learn Apache Hive Installation on Ubuntu from Industry Experts.

2. Apache Hive Internal and External Tables

Hive is an open source data warehouse system used for querying and analyzing large datasets. Data in Apache Hive can be categorized into Table, Partition, and Bucket. The table in Hive is logically made up of the data being stored. Hive has two types of tables which are as follows:

  • Managed Table (Internal Table)
  • External Table

Hive Managed Tables-
It is also know an internal table. When we create a table in Hive, it by default manages the data. This means that Hive moves the data into its warehouse directory.
Hive External Tables-
We can also create an external table. It tells Hive to refer to the data that is at an existing location outside the warehouse directory.
Let’s now discuss the Hive Internal tables vs External tables comparison.

Hadoop Quiz
If these professionals can make a switch to Big Data, so can you:
Rahul Doddamani Story - DataFlair
Rahul Doddamani
Java → Big Data Consultant, JDA
Follow on
Mritunjay Singh Success Story - DataFlair
Mritunjay Singh
PeopleSoft → Big Data Architect, Hexaware
Follow on
Rahul Doddamani Success Story - DataFlair
Rahul Doddamani
Big Data Consultant, JDA
Follow on
I got placed, scored 100% hike, and transformed my career with DataFlair
Enroll now
Richa Tandon Success Story - DataFlair
Richa Tandon
Support → Big Data Engineer, IBM
Follow on
DataFlair Web Services
You could be next!
Enroll now

3. Featured Difference between Hive Internal Tables vs External Tables

Here we are going to cover the comparison between Hive Internal tables vs External tables on the basis of different features. Let’s discuss them one by one-

i. LOAD and DROP Semantics

We can see the main difference between the two table type in the LOAD and DROP semantics.

  • Managed Tables – When we load data into a Managed table, then Hive moves data into Hive warehouse directory.

For example:

CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

This moves the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table table, which is hdfs://user/hive/warehouse/managed_table.
Further, if we drop the table using:
DROP TABLE managed_table
Then this will delete the table metadata including its data. The data no longer exists anywhere. This is what it means for HIVE to manage the data.

  • External Tables – External table behaves differently. In this, we can control the creation and deletion of the data. The location of the external data is specified at the table creation time:

CREATE EXTERNAL TABLE external_table(dummy STRING)
LOCATION '/user/tom/external_table';
LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;

Now, with the EXTERNAL keyword, Apache Hive knows that it is not managing the data. So it doesn’t move data to its warehouse directory. It does not even check whether the external location exists at the time it is defined. This very useful feature because it means we create the data lazily after creating the table.
The important thing to notice is that when we drop an external table, Hive will leave the data untouched and only delete the metadata.

ii. Security

  • Managed Tables – Hive solely controls the Managed table security. Within Hive, security needs to be managed; probably at the schema level (depends on organization).
  • External Tables – These tables’ files are accessible to anyone who has access to HDFS file structure. So, it needs to manage security at the HDFS file/folder level.

iii. When to use Managed and external table

Use Managed table when –

  • We want Hive to completely manage the lifecycle of the data and table.
  • Data is temporary

Use External table when –

  • Data is used outside of Hive. For example, the data files are read and processed by an existing program that does not lock the files.
  • We are not creating a table based on the existing table.
  • We need data to remain in the underlying location even after a DROP TABLE. This may apply if we are pointing multiple schemas at a single data set.
  • The hive shouldn’t own data and control settings, directories etc., we may have another program or process that will do these things.

So, this was all in Hive Internal Tables vs External Tables. Hope you like our explanation.

4. Conclusion

In conclusion, Managed tables are like normal database table in which we can store data and query on. On dropping Managed tables, the data stored in them is also deleted and data is lost forever. While dropping External tables will delete metadata but not the data.
See Also-

Reference for Hive

4 Responses

  1. Babu Venkat says:

    Can i retrieve managed table data after deleting?

  2. Mohit Jain says:

    Dear Sir, I have 2 queries which are as follows :-

    If I alter the default Location(/user/hive/warehouse/) of Internal Table by using “Location” keyword with path, then that newly altered table will be treated/behave as Internal Table or External Table?

    • Rahul says:

      If you change the location of internal table (managed table), then also it will be treated as internal table.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.