Hive Partitioning Vs. Bucketing

Viewing 1 reply thread
  • Author
    Posts
    • #6242
      DataFlair TeamDataFlair Team
      Spectator

      What is difference between Partitioning and Bucketing in Hive and which technique is used in which situation?

    • #6245
      DataFlair TeamDataFlair Team
      Spectator

      HIVE? is a Data Warehouse infrastructure tool to process the structured data in Hadoop. It is useful in summarising the data and makes querying & analysing easy. It provides SQL dialect called HQL (Hive Query Language) for querying the data that is stored in the hadoop cluster.

      Hive Partitioning and Bucketing both are used to performance optimization.

      Example of Partitioning and Bucketing:
      CREATE TABLE newemployee (
      employee_id int,
      name string,
      city string
      )
      PARTITIONED BY (year STRING, month STRING, day STRING)
      CLUSTERED BY (employee_id) INTO 64 BUCKETS

      Partitioning
      Partitioning is nothing but dividing the data into compartments which helps in organizing data in logical fashion. Partitioning data is often used for distributing load horizontally.

      With reference to Table creation example above when we insert some data into a partition for 2017-08-16. Hive will then store data in a directory hierarchy, such as:

      /user/hive/warehouse/newemployee/y=2017/m=08/d=16

      Bucketing
      Hive Buckets is a technique of decomposing data or decreasing the data into more manageable parts or equal parts.
      So for example as shown above in the newemployee table if we partition by date and bucketing by employee_id. The value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket.

      Hive Partitioning Vs. Bucketing

      PARTITIONING

      1. Hive Partitioning is dividing the large amount of data into number pieces of folders based on table columns value.
      2. Partitioning can be done on multiple columns.
      3. For Partitioning in hive we have to use PARTITIONED BY (COL1,COL2…etc) command while hive table creation.
      4. Partitioning can be performed on any number of columns in a table by using hive partition concept.
      5. Partitioning can be performed on Hive Tables like Managed tables or External tables.
      6. Partitioning works better when the cardinality of the partitioning field is not too high. ‘cardinality’ means to the number of possible values a field can have. The cardinality of the field relates to the number of directories that could be created on the file system. As an example, if you partition by employee_id and you have millions of employees, you may end up having millions of directories in your file system.
      7. By using Partitioning we can distribute execution load horizontally.
      8. Partitioning gives better performance and faster execution of queries in case of partition with low volume of data.
      9. By partitioning, we can create multiple small partitions based on column values.

      BUCKETING

      1. Bucketing AKA Clustering, will result in a fixed number of files, since you specify the number of buckets at the time of table creation. Hive takes the field, calculate a hash and assign a record to that bucket.
      2. Bucketing can only be applied on one field
      3. To use bucketing in hive then you should use CLUSTERED BY (Col) command while creating a table in Hive.
      4. Hive bucketing optimization only on one column only not more than one.
      5. We can perform Hive bucketing concept on Hive Managed tables or External tables.
      6. Bucketing works well when the field has high cardinality and data is evenly distributed among buckets. ‘cardinality’ means to the number of possible values a field can have.
      7. Hive bucketing divides the data into number of equal parts.
      8. Bucketing gives better performance and faster execution of queries such as joins with other tables that have exactly the same bucketing
      9. By using bucketing, we restrict the number of buckets to store the data. This number is defined during table creation scripts.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.