Bucket Map Join in Hive – Tips & Working

1. Objective

In the last article, we discuss Map Side Join in Hive. Basically, while the tables are large and all the tables used in the join are bucketed on the join columns we use a Bucket Map Join in Hive. In this article, we will cover the whole concept of Apache Hive Bucket Map Join. It also includes use cases, disadvantages, and Bucket Map Join example which will enhance our knowledge.

What is Bucket Map Join in Hive

What is Bucket Map Join in Hive

2. Introduction to Bucket Map Join

In Apache Hive, while the tables are large and all the tables used in the join are bucketed on the join columns we use Hive Bucket Map Join feature. Moreover, one table should have buckets in multiples of the number of buckets in another table in this type of join.
Let’s revise Bucketing in Hive

Bucket Map Join

How Bucket Map Join Works

Let’s understand with an example. For suppose if one table has 2 buckets then the other table must have either 2 buckets or a multiple of 2 buckets (2, 4, 6, and so on). Further, since the preceding condition is satisfied then the joining can be done on the mapper side only. Else a normal inner join is performed. Therefore, it implies that only the required buckets are fetched on the mapper side and not the complete table.
Hence, onto each mapper, only the matching buckets of all small tables are replicated. As a result of this, the efficiency of the query improves drastically. However, make sure data does not sort in a bucket map join.

Also, note that by default Hive does not support a bucket map join. So, we need to set the following property  to true for the query to work as this join:

set hive.optimize.bucketmapjoin = true
Let’s See Difference between Partition Vs Bucketing in Hive 

Get the most demanding skills of IT Industry - Learn Hadoop

3. How does it work in Hive?

Basically, Join is done in Mapper only.  However, let’s understand it in this way, the mapper processing bucket 1 for table A will only fetch bucket 1 of table B.

4. Use Case of Bucket Map Join

To be more specific we use this feature with several scenarios. Like:
i. While all the tables are large.
ii. Also, while all tables bucketed using the join columns.
iii. Moreover, while the number of buckets in one table is a multiple of the number of buckets in the other table.
iii. Also, when all the tables do not sort.

5. Disadvantages of Bucket Map Join in Hive

The major disadvantage of using Bucket Map Join is, here tables need to be bucketed in the same way how the SQL joins. That implies we can not use it for other types of SQLs.
Let’s learn HiveQL SELECT Statement in detail

Hadoop Quiz

6. Tips on Bucket Map Join 

i. At first, it is very important that the tables are created bucketed on the same join columns. Also, it is important to bucket data while inserting.
However, one of the ways is to set “hive.enforce.bucketing=true” before inserting data.
For example:
create table b1(col0 string,col1 string,col2 string,col3 string,col4 string,col5 string,col6 string)
clustered by (col0) into 32 buckets;
create table b2(col0 string,col1 string,col2 string,col3 string,col4 string,col5 string,col6 string)
clustered by (col0) into 8 buckets;

set hive.enforce.bucketing = true;
From passwords insert OVERWRITE  table b1 select * limit 10000;
From passwords insert OVERWRITE  table b2 select * limit 10000;
ii. Also, it is must to set hive.optimize.bucketmapjoin to true.
set hive.optimize.bucketmapjoin=true;
select /*+ MAPJOIN(b2) */ b1.* from b1,b2 where b1.col0=b2.col0;
Read about Comparison between Hive Internal Tables vs External Tables

7. Conclusion

As a result, we have seen the complete content regarding Apache Hive Bucket Map Join feature, Bucket Map Join example, use cases, Working, and Disadvantages of Bucket Map Join. In next article, we will see Skew Join in Hive. Although, if any query arises, please ask in a comment section.
Related Topic- Difference Between Hive and Impala
For reference

3 Responses

  1. amit says:

    Very well explained

    • Data Flair says:

      Thank You, Amit for commenting on “Bucket Map Join in Hive”. We hope this Apache Hive Tutorial, helps you to solve your queries.
      Check out more Hive Tutorials to learn Hive.
      Regards,
      Data-Flair

  2. Kartik Gumber says:

    Hi. Can you explain the scenario where my record for B table is in the 2nd bucket and the corresponding join id are present in the 1st bucket of table A and C?
    As mentioned above an only matching bucket of all small tables are replicated on the mapper. How it will be done?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.