Hive Partitioning Vs. Bucketing

This topic has 1 reply, 1 voice, and was last updated 5 years, 6 months ago by DataFlair Team.

Viewing 1 reply thread

Author

Posts
- September 20, 2018 at 5:27 pm #6242
  
  DataFlair Team
  Spectator
  
  What is difference between Partitioning and Bucketing in Hive and which technique is used in which situation?
- September 20, 2018 at 5:28 pm #6245
  
  DataFlair Team
  Spectator
  
  HIVE? is a Data Warehouse infrastructure tool to process the structured data in Hadoop. It is useful in summarising the data and makes querying & analysing easy. It provides SQL dialect called HQL (Hive Query Language) for querying the data that is stored in the hadoop cluster.
  
  Hive Partitioning and Bucketing both are used to performance optimization.
  
  Example of Partitioning and Bucketing:
  CREATE TABLE newemployee (
  employee_id int,
  name string,
  city string
  )
  PARTITIONED BY (year STRING, month STRING, day STRING)
  CLUSTERED BY (employee_id) INTO 64 BUCKETS
  
  Partitioning
  Partitioning is nothing but dividing the data into compartments which helps in organizing data in logical fashion. Partitioning data is often used for distributing load horizontally.
  
  With reference to Table creation example above when we insert some data into a partition for 2017-08-16. Hive will then store data in a directory hierarchy, such as:
  
  /user/hive/warehouse/newemployee/y=2017/m=08/d=16
  
  Bucketing
  Hive Buckets is a technique of decomposing data or decreasing the data into more manageable parts or equal parts.
  So for example as shown above in the newemployee table if we partition by date and bucketing by employee_id. The value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket.
  
  Hive Partitioning Vs. Bucketing
  
  PARTITIONING
  
  1. Hive Partitioning is dividing the large amount of data into number pieces of folders based on table columns value.
  2. Partitioning can be done on multiple columns.
  3. For Partitioning in hive we have to use PARTITIONED BY (COL1,COL2…etc) command while hive table creation.
  4. Partitioning can be performed on any number of columns in a table by using hive partition concept.
  5. Partitioning can be performed on Hive Tables like Managed tables or External tables.
  6. Partitioning works better when the cardinality of the partitioning field is not too high. ‘cardinality’ means to the number of possible values a field can have. The cardinality of the field relates to the number of directories that could be created on the file system. As an example, if you partition by employee_id and you have millions of employees, you may end up having millions of directories in your file system.
  7. By using Partitioning we can distribute execution load horizontally.
  8. Partitioning gives better performance and faster execution of queries in case of partition with low volume of data.
  9. By partitioning, we can create multiple small partitions based on column values.
  
  BUCKETING
  
  1. Bucketing AKA Clustering, will result in a fixed number of files, since you specify the number of buckets at the time of table creation. Hive takes the field, calculate a hash and assign a record to that bucket.
  2. Bucketing can only be applied on one field
  3. To use bucketing in hive then you should use CLUSTERED BY (Col) command while creating a table in Hive.
  4. Hive bucketing optimization only on one column only not more than one.
  5. We can perform Hive bucketing concept on Hive Managed tables or External tables.
  6. Bucketing works well when the field has high cardinality and data is evenly distributed among buckets. ‘cardinality’ means to the number of possible values a field can have.
  7. Hive bucketing divides the data into number of equal parts.
  8. Bucketing gives better performance and faster execution of queries such as joins with other tables that have exactly the same bucketing
  9. By using bucketing, we restrict the number of buckets to store the data. This number is defined during table creation scripts.
Author

Posts

Viewing 1 reply thread

You must be logged in to reply to this topic.

Hive Partitioning Vs. Bucketing

About DataFlair

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Web Dev Courses

Trending Courses

Trending Python Courses

Trending Java Courses

Trending DSA Courses