What is Bucketing and Clustering in Hive?

This topic has 2 replies, 1 voice, and was last updated 5 years, 7 months ago by DataFlair Team.

Viewing 2 reply threads

Author

Posts
- September 20, 2018 at 2:15 pm #5103
  
  DataFlair Team
  Spectator
  
  What is Bucketing and Clustering in Hive?
  how it is different from partitioning ?
- September 20, 2018 at 2:16 pm #5106
  
  DataFlair Team
  Spectator
  
  What is Bucketing and Clustering in HIVE?
  
  Bucketing and Clustering is the process in Hive, to decompose table data sets into more manageable parts.
  The bucketing concept is based on HashFunction(Bucketing column) mod No.of Buckets. The bucket number is found by this HashFunction. No. of buckets is mentioned while creating bucket table.
  
  In Hive partitioning, the table is divided into the number of partitions, and these partitions can be further subdivided into more manageable parts known as Buckets/Clusters. Records with the same bucketed column will be stored in the same bucket.
  “clustered by” clause is used to divide the table into buckets. Each bucket will be saved as a file under table directory.
  Bucketing can be done along with partitioning or without partitioning on Hive tables.
  Bucketed tables will create almost equally distributed data file parts. We can also sort the records in each bucket by one or more columns. Since the data files are equally sized parts, map-side joins will be faster on bucketed tables.
  
  The property hive.enforce.bucketing = true enables dynamic bucketing while loading data into the Hive table, and sets the number of reducers equal to the number of buckets specified.
  
  Below is the example to create bucketed table,
  
  Eg: create table bucketed_table (ID int, name varchar(64), state varchar(64), city varchar(64))
  partitioned by (country varchar(64))
  clustered by (state) sorted by (city) into 4 buckets
  row format delimited fields terminated by ‘,’;
  
  Here, for a particular country, each state records will be clustered under a bucket.
  
  Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try out queries on a fraction
  of data for testing and debugging purpose when the original data sets are very huge.
  
  How is it different from partitioning?
  
  Unlike in partitioning, where tables are divided into partitions via creating a directory for each partition, whereas, in Bucketing, buckets are divided as files.
  In partitioning, tables are not equally partitioned, whereas in Bucketing, buckets are almost equally divided, and the no. of buckets can be specified while creation of bucketed table.
- September 20, 2018 at 2:16 pm #5108
  
  DataFlair Team
  Spectator
  
  Bucketing is the way of dividing table data sets into more manageable parts.It is based on (hash function on the bucketed column) mod (total number of buckets).hash function depends on the type of bucketed column.
  Records with same bucketed column will be stored in same bucket.
  Each bucket is just a file in table directory and bucketing number is 1-based.
  Bucketing can be done along with Partitioning on Hive tables and even without partitioning.
  Bucketed tables will create almost equally distributed data file parts.It offers effiecient sampling than non bucketed tables.
  As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-bucketed tables. In Map-side join, a mapper processing a bucket of the left table knows that the matching rows in the right table will be in its corresponding bucket, so it only retrieves that bucket (which is a small fraction of all the data stored in the right table).
  Similar to partitioning, bucketed tables provide faster query responses than non-bucketed tables
  
  For example:
  
  CREATE TABLE bucketed_user(
  firstname VARCHAR(64),
  lastname VARCHAR(64),
  address STRING,
  city VARCHAR(64),
  state VARCHAR(64),
  post STRING,
  country VARCHAR(64)
  )
  PARTITIONED BY (country VARCHAR(64))
  CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
  STORED AS SEQUENCEFILE;
  
  By Setting property “hive.enforce.bucketing = true” we enable the dynamic bucketing while loading the data into hive table.
  It will automatically sets the number of reduce tasks to be equal to the number of buckets mentioned in the table definition.
Author

Posts

Viewing 2 reply threads

You must be logged in to reply to this topic.

What is Bucketing and Clustering in Hive?

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses