

{"id":3927,"date":"2017-09-04T13:12:58","date_gmt":"2017-09-04T13:12:58","guid":{"rendered":"http:\/\/data-flair.training\/blogs\/?p=3927"},"modified":"2017-09-04T13:12:58","modified_gmt":"2017-09-04T13:12:58","slug":"hive-data-model","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/hive-data-model\/","title":{"rendered":"Hive Data Model &#8211; Learn to Develop Data Models in Hive"},"content":{"rendered":"<p>We have already explained <strong>what is Hive<\/strong> in our previous article. In this article, we will do our best to answer questions like what is the Hive data Model, what are the different types of Data Models in the Hive.<\/p>\n<p>Our hope is that after reading this article, you will have a clear understanding of the Hive data Model, and what is the significance of these data models.<\/p>\n<h2>Introduction to Hive Data Model<\/h2>\n<p><strong>Hive<\/strong> is an open source data warehouse system built on top of Hadoop Haused for querying and analyzing large datasets stored in Hadoop files. It process structured and semi-structured data in Hadoop. Data in Apache Hive can be categorized into:<\/p>\n<ul>\n<li>Table<\/li>\n<li>Partition<\/li>\n<li>Bucket<\/li>\n<\/ul>\n<p>Let us now understand these data modeling considerations in Hive one by one-<\/p>\n<div id=\"attachment_3929\" style=\"width: 812px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/apache-hive-data-model.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-3929\" class=\"wp-image-3929 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/apache-hive-data-model.jpg\" alt=\"Apache Hive Data models\" width=\"802\" height=\"420\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/apache-hive-data-model.jpg 802w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/apache-hive-data-model-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/apache-hive-data-model-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/apache-hive-data-model-768x402.jpg 768w\" sizes=\"auto, (max-width: 802px) 100vw, 802px\" \/><\/a><p id=\"caption-attachment-3929\" class=\"wp-caption-text\">Apache Hive Data Model<\/p><\/div>\n<h3>i. Table<\/h3>\n<p><strong>Apache Hive tables<\/strong> are the same as the tables present in a Relational Database. The table in Hive is logically made up of the data being stored. And the associated metadata describes the layout of the data in the table. We can perform <em>filter, project, join<\/em> and <em>union<\/em> operations on tables.<\/p>\n<p>In Hadoop data typically resides in <strong>HDFS<\/strong>, although it may reside in any Hadoop filesystem, including the local filesystem or<strong> S3<\/strong>. But Hive stores the metadata in a relational database and not in HDFS. Hive has two types of tables which are as follows:<\/p>\n<ul>\n<li>Managed Table<\/li>\n<li>External Table<\/li>\n<\/ul>\n<p>In Hive when we create a table, Hive by default manage the data. It means that Hive moves the data into its warehouse directory. Alternatively, we can also create an external table, it tells Hive to refer to the data that is at an existing location outside the warehouse directory.<\/p>\n<p>We can see the difference between the two table type in the <strong>LOAD<\/strong> and <strong>DROP<\/strong> semantics. Let us consider a Managed table first.<\/p>\n<h4>a. Managed Table<\/h4>\n<p>When we load data into a <strong>Managed table<\/strong>, Hive moves data into Hive warehouse directory.<br \/>\n<strong>For example:<\/strong><br \/>\n[php]CREATE TABLE managed_table (dummy STRING);<br \/>\nLOAD DATA INPATH &#8216;\/user\/tom\/data.txt&#8217; INTO table managed_table;[\/php]<br \/>\nThis will move the file <em><strong>hdfs:\/\/user\/tom\/data.txt<\/strong><\/em> into Hive\u2019s warehouse directory for the managed_table table, which is <em><strong>hdfs:\/\/user\/hive\/warehouse\/managed_table<\/strong><\/em>.<\/p>\n<p>Further, if we drop the table using:<br \/>\n[php]DROP TABLE managed_table[\/php]<\/p>\n<p>Then this will delete the table including its data and metadata. It bears repeating that since initial LOAD performed a move operation. And the DROP performed a delete operation. The data no longer exists anywhere. This is what it means for HIVE to manage the data.<\/p>\n<h4><strong>b. External Table<\/strong><\/h4>\n<p>The <strong>external table<\/strong> in Hive behaves differently. We can control the creation and deletion of the data. The location of the external data is specified at the table creation time:<\/p>\n<p>[php]CREATE EXTERNAL TABLE external_table\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 (dummy STRING)<br \/>\nLOCATION &#8216;\/user\/tom\/external_table&#8217;;<br \/>\nLOAD DATA INPATH &#8216;\/user\/tom\/data.txt&#8217; INTO TABLE external_table;[\/php]<\/p>\n<p>Now, with the EXTERNAL keyword, Apache Hive knows that it is not managing the data. So it does not move data to its warehouse directory. It does not even check whether the external location exists at the time it is defined. This very useful feature because it means we create the data lazily after creating the table.<\/p>\n<p>An important thing to notice is that when we <em>drop<\/em> an external table, Hive will leave the data untouched and only delete the metadata.<\/p>\n<h3>ii. Partition<\/h3>\n<div id=\"attachment_3931\" style=\"width: 1546px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/hive-data-model.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-3931\" class=\"wp-image-3931 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/hive-data-model.jpg\" alt=\"Aapche Hive Data Model - Table, Partitions, Bucket\" width=\"1536\" height=\"864\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/hive-data-model.jpg 1536w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/hive-data-model-150x84.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/hive-data-model-300x169.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/hive-data-model-768x432.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/hive-data-model-1024x576.jpg 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><p id=\"caption-attachment-3931\" class=\"wp-caption-text\">Hive Data Model<\/p><\/div>\n<p>Apache Hive organizes tables into <strong>partitions<\/strong> for grouping same type of data together based on a column or partition key. Each table in the hive can have one or more partition keys to identify a particular partition. Using partition we can also make it faster to do queries on slices of the data.<\/p>\n<p><strong>Command:<\/strong><\/p>\n<p>[php]CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2 data_type,\u2026.);[\/php]<\/p>\n<p>By example, we can understand partition. Consider you have a table student_details containing the student information of some engineering college like student_id, name, department, year, etc. Now, if you want to perform partitioning on the basis of department column. Then the information of all the students belonging to a particular department will be stored together in that very partition. Physically, a partition in Hive is nothing but just a sub-directory in the table directory.<\/p>\n<p>Suppose you have data for three departments in our student_details table \u2013 EEE, ECE and ME. Thus you will have three partitions in total for each of the departments as you can see in the diagram below. For each department, you will have all the data regarding that very department residing in a separate subdirectory\u00a0under the table directory.<\/p>\n<p>So for example, all the student data regarding EEE departments will be stored in user\/hive\/warehouse\/student_details\/dept.=EEE. So, the queries regarding EEE students would only have to look through the data present in the EEE\u00a0partition.<\/p>\n<p>Hence, from above example, you can say that partitioning is very useful. Because it reduces the query latency by scanning only\u00a0<strong>relevant<\/strong>\u00a0partitioned data instead\u00a0of the whole data set.\u00a0In real-world\u00a0implementations, you might be dealing with hundreds of TBs of data. So, you can imagine scanning this huge amount of data for some query where\u00a0<strong>95%<\/strong>\u00a0data scanned by you was un-relevant to your query. \u00a0<strong>\u00a0<\/strong><\/p>\n<h3>iii. Buckets<\/h3>\n<p>In Hive, Tables or partition are subdivided into <strong>buckets<\/strong> based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries.<\/p>\n<p><strong>Command:<\/strong><\/p>\n<p>[php]CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2 data_type,\u2026.) CLUSTERED BY (column_name1, column_name2, \u2026) SORTED BY (column_name [ASC|DESC], \u2026)] INTO num_buckets BUCKETS;[\/php]<\/p>\n<p>Each bucket in Hive is just a file in the table directory (unpartitioned table) or the partition directory. So, you have chosen to divide the partitions into n buckets. Then you will have n files in each of your partition directories. Hence, from the above diagram, you can see, where you have bucketed each partition into 2 buckets. Therefore each partition, say EEE, will have two files where each of them will be storing the EEE student\u2019s data.<\/p>\n<h2>Conclusion<\/h2>\n<p>In conclusion, we can say that Data in Hive can be categorized into three types on the granular level: Table, Partition, and Bucket. Hive Table is made up of data that is being stored in it. Further, we have learned that Hive organizes tables into partitions. And then subdivides partition into buckets.<\/p>\n<p>We hope that this article helps you understand what is Hive Data Model and what is the use of Hive data model in Hadoop. If you think we missed something, then please let us know in the comments below.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We have already explained what is Hive in our previous article. In this article, we will do our best to answer questions like what is the Hive data Model, what are the different types&#46;&#46;&#46;<\/p>\n","protected":false},"author":7,"featured_media":42107,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26],"tags":[3388,5686,5698,5765,5794,5795],"class_list":["post-3927","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-hive","tag-data-models-in-hive","tag-hive-buckets","tag-hive-data-model","tag-hive-partitions","tag-hive-tables","tag-hive-tutorial"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Hive Data Model - Learn to Develop Data Models in Hive - DataFlair<\/title>\n<meta name=\"description\" content=\"Apache Hive Data Model cover Table, partition, bucket in Hive.Type of Tables-Hive managed table,Hive external table and Commands to create Hive data models\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/hive-data-model\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Hive Data Model - Learn to Develop Data Models in Hive - DataFlair\" \/>\n<meta property=\"og:description\" content=\"Apache Hive Data Model cover Table, partition, bucket in Hive.Type of Tables-Hive managed table,Hive external table and Commands to create Hive data models\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/hive-data-model\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2017-09-04T13:12:58+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/Hive-Data-Model-01-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Hive Data Model - Learn to Develop Data Models in Hive - DataFlair","description":"Apache Hive Data Model cover Table, partition, bucket in Hive.Type of Tables-Hive managed table,Hive external table and Commands to create Hive data models","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/hive-data-model\/","og_locale":"en_US","og_type":"article","og_title":"Hive Data Model - Learn to Develop Data Models in Hive - DataFlair","og_description":"Apache Hive Data Model cover Table, partition, bucket in Hive.Type of Tables-Hive managed table,Hive external table and Commands to create Hive data models","og_url":"https:\/\/data-flair.training\/blogs\/hive-data-model\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2017-09-04T13:12:58+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/Hive-Data-Model-01-1.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/beb0cab24b7aa54423a3b50e669a9dcd"},"headline":"Hive Data Model &#8211; Learn to Develop Data Models in Hive","datePublished":"2017-09-04T13:12:58+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/"},"wordCount":1155,"commentCount":2,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/Hive-Data-Model-01-1.jpg","keywords":["Data Models in Hive","Hive Buckets","Hive Data Model","Hive Partitions","Hive Tables","hive tutorial"],"articleSection":["Hive Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/hive-data-model\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/","url":"https:\/\/data-flair.training\/blogs\/hive-data-model\/","name":"Hive Data Model - Learn to Develop Data Models in Hive - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/Hive-Data-Model-01-1.jpg","datePublished":"2017-09-04T13:12:58+00:00","description":"Apache Hive Data Model cover Table, partition, bucket in Hive.Type of Tables-Hive managed table,Hive external table and Commands to create Hive data models","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/hive-data-model\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/Hive-Data-Model-01-1.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/09\/Hive-Data-Model-01-1.jpg","width":1200,"height":628,"caption":"Hive Data Model"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/hive-data-model\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"Hive Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/hive\/"},{"@type":"ListItem","position":3,"name":"Hive Data Model &#8211; Learn to Develop Data Models in Hive"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/beb0cab24b7aa54423a3b50e669a9dcd","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/c322416204232f4dd97ef3901b0a499a5d34d7ba7fe333f4bfe53a907873d293?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/c322416204232f4dd97ef3901b0a499a5d34d7ba7fe333f4bfe53a907873d293?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/c322416204232f4dd97ef3901b0a499a5d34d7ba7fe333f4bfe53a907873d293?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"DataFlair Team specializes in creating clear, actionable content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Backed by industry expertise, we make learning easy and career-oriented for beginners and pros alike.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam3\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/3927","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=3927"}],"version-history":[{"count":0,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/3927\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/42107"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=3927"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=3927"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=3927"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}