What is Bucketing and Clustering in Hive?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What is Bucketing and Clustering in Hive?

Viewing 2 reply threads
  • Author
    Posts
    • #5103
      DataFlair TeamDataFlair Team
      Spectator

      What is Bucketing and Clustering in Hive?
      how it is different from partitioning ?

    • #5106
      DataFlair TeamDataFlair Team
      Spectator

      What is Bucketing and Clustering in HIVE?

      Bucketing and Clustering is the process in Hive, to decompose table data sets into more manageable parts.
      The bucketing concept is based on HashFunction(Bucketing column) mod No.of Buckets. The bucket number is found by this HashFunction. No. of buckets is mentioned while creating bucket table.

      In Hive partitioning, the table is divided into the number of partitions, and these partitions can be further subdivided into more manageable parts known as Buckets/Clusters. Records with the same bucketed column will be stored in the same bucket.
      “clustered by” clause is used to divide the table into buckets. Each bucket will be saved as a file under table directory.
      Bucketing can be done along with partitioning or without partitioning on Hive tables.
      Bucketed tables will create almost equally distributed data file parts. We can also sort the records in each bucket by one or more columns. Since the data files are equally sized parts, map-side joins will be faster on bucketed tables.

      The property hive.enforce.bucketing = true enables dynamic bucketing while loading data into the Hive table, and sets the number of reducers equal to the number of buckets specified.

      Below is the example to create bucketed table,

      Eg: create table bucketed_table (ID int, name varchar(64), state varchar(64), city varchar(64))
      partitioned by (country varchar(64))
      clustered by (state) sorted by (city) into 4 buckets
      row format delimited fields terminated by ‘,’;

      Here, for a particular country, each state records will be clustered under a bucket.

      Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try out queries on a fraction
      of data for testing and debugging purpose when the original data sets are very huge.

      How is it different from partitioning?

      Unlike in partitioning, where tables are divided into partitions via creating a directory for each partition, whereas, in Bucketing, buckets are divided as files.
      In partitioning, tables are not equally partitioned, whereas in Bucketing, buckets are almost equally divided, and the no. of buckets can be specified while creation of bucketed table.

    • #5108
      DataFlair TeamDataFlair Team
      Spectator

      Bucketing is the way of dividing table data sets into more manageable parts.It is based on (hash function on the bucketed column) mod (total number of buckets).hash function depends on the type of bucketed column.
      Records with same bucketed column will be stored in same bucket.
      Each bucket is just a file in table directory and bucketing number is 1-based.
      Bucketing can be done along with Partitioning on Hive tables and even without partitioning.
      Bucketed tables will create almost equally distributed data file parts.It offers effiecient sampling than non bucketed tables.
      As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-bucketed tables. In Map-side join, a mapper processing a bucket of the left table knows that the matching rows in the right table will be in its corresponding bucket, so it only retrieves that bucket (which is a small fraction of all the data stored in the right table).
      Similar to partitioning, bucketed tables provide faster query responses than non-bucketed tables

      For example:

      CREATE TABLE bucketed_user(
      firstname VARCHAR(64),
      lastname VARCHAR(64),
      address STRING,
      city VARCHAR(64),
      state VARCHAR(64),
      post STRING,
      country VARCHAR(64)
      )
      PARTITIONED BY (country VARCHAR(64))
      CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
      STORED AS SEQUENCEFILE;

      By Setting property “hive.enforce.bucketing = true” we enable the dynamic bucketing while loading the data into hive table.
      It will automatically sets the number of reduce tasks to be equal to the number of buckets mentioned in the table definition.

Viewing 2 reply threads
  • You must be logged in to reply to this topic.