Site icon DataFlair

Comparison between Hive Internal Tables vs External Tables

Curious to know different types of Hive tables and how they are different from each other?

As discussed the basics of Hive tables in Hive Data Models, let us now explore the major difference between hive internal and external tables.

In this article, we are going to discuss the two different types of Hive Table that are Internal table (Managed table) and External table. The article then enlists the differences between Hive Internal tables and External Tables. We will also see different cases where we can use these Hive tables.

So, let’s start with Hive internal tables and external tables.

Introduction to Hive Internal and External tables

 

Fundamentally, Hive knows two different types of tables: Internal table and the External table. The Internal table is also known as the managed table.

We can identify the internal or External tables using the DESCRIBE FORMATTED table_name statement in the Hive, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on the table type.

1. Hive Internal Table

Hive owns the data for the internal tables.

It is the default table in Hive. When the user creates a table in Hive without specifying it as external, then by default, an internal table gets created in a specific location in HDFS.

Technology is evolving rapidly!
Stay updated with DataFlair on WhatsApp!!

By default, an internal table will be created in a folder path similar to /user/hive/warehouse directory of HDFS. We can override the default location by the location property during table creation.

If we drop the managed table or partition, the table data and the metadata associated with that table will be deleted from the HDFS.

2. Hive External Table

Hive does not manage the data of the External table.

We create an external table for external use as when we want to use the data outside the Hive.

External tables are stored outside the warehouse directory. They can access data stored in sources such as remote HDFS locations or Azure Storage Volumes.

Whenever we drop the external table, then only the metadata associated with the table will get deleted, the table data remains untouched by Hive.

We can create the external table by specifying the EXTERNAL keyword in the Hive create table statement.

Difference between Hive Internal and External Table

Let us now see the difference between both Hive tables. The major differences in the internal and external tables in Hive are:

1. LOAD semantics

The Load semantics varies in both the tables. Let us see the difference in load semantics between the internal table and the external table.

a. Internal Table

When we load data into an internal table, then Hive moves data into the warehouse directory.

For example:

Here in this example, we are creating a table ‘internaldemo’. When we load data into the ‘internaldemo’ table, then Hive moves the data to the warehouse directory.

Now, loading data into the internal table created above.

On describing the table, we see that the table data is moved to the Hive warehouse directory.

b. External Table

With the EXTERNAL keyword, Hive knows that it is not managing the table data, so it does not move data to its warehouse directory. Hive does not even check whether the external location at the time it is defined exists or not.

For example:

In this example, we are creating an external table, ‘external_demo’ on the specified location that is ‘/home/dataflair/’. On loading the data into the external table, the Hive does not move table data to its warehouse directory.

Now, loading data to the external table created above.

On browsing the table, we can see that the Hive table data is not moved to the Hive warehouse directory. It is stored in the location specified while creating a table.


2. DROP semantics

Like load semantics, drop semantics also varies in both tables. Let us see the difference in drop semantics between the internal table and the external table.

a. Internal table

Dropping the internal table will delete the table data, as well as the metadata associated with the table.

For example:

In this example, we are dropping the managed table ‘internaldemo’. You can see that it will delete both the table metadata as well as the table data.

Table data also gets deleted from the HDFS.

b. External table

Dropping the external table will delete only table metadata. The table content remains untouched.

For example:

In this example, we can see that on deleting the external table ‘external demo’, the content of the table is still present in the HDFS location.

3. TRUNCATE support

The TRUNCATE command only works for the internal table.

For example:

In this example, we are trying to use the truncate command with ‘externaldemo’ table. We will get an error message that says ‘Cannot truncate non-managed table externaldemo’.

4. ACID Support

ACID/transactional works only for the internal table. They do not work for External Table.

5. Query Result Caching

Query Result Caching that saves the results of an executed Hive query for reuse on subsequent queries only works for the internal table.

Thus there are some differences between the Hive internal table and the external table.

Let us now see when to use which Hive table.

When to use the Internal and External table?

1. Hive Internal Table

We can use the internal table in cases:

2. Hive External Table

We can use the external table in cases:

So, this was all in Hive internal and external tables. I hope you like our explanation.

Conclusion

After reading the article, we can conclude that the Hive tables are of two types- internal table, also known as the managed table, and the External table.

The internal table data gets managed by the Hive. Hive is not responsible for managing data of the External table.

On deleting the Hive internal table, the table data and the metadata both get deleted and on deleting the Hive External table, only the table metadata will delete.

Exit mobile version