Top 30 Tricky Hive Interview Questions and Answers
Hive Interview Questions and Answers
While it comes to prepare for a Hadoop job interview, you should be aware that question may arise on its several tools. Such as Flume, Sqoop, HBase, MapReduce, Hive and many more. So, in this blog, ”Hive Interview Questions” we are providing a list of most commonly asked Hive Interview Questions and answers in this year. Hence, that will help you face your Hadoop job interview.
Basically, to make candidates familiar with the nature of questions that are likely to be asked on the subject of Hive, These Hive scenario based interview questions and answers are formulated. Moreover, both freshers, as well as experienced candidates, can refer to this blog. That implies this blog contains questions for every candidate who is preparing for a Hadoop job interview.
Top 30 Most Asked Interview Questions for Hive
So, here are top 30 frequently asked Hive Interview Questions:
Que 1. What is Apache Hive?
Ans. Basically, a tool which we call a data warehousing tool is Hive. However, Hive gives SQL queries to perform an analysis and also an abstraction. Although, Hive it is not a database it gives you logical abstraction over the databases and the tables.
Que 2. What kind of applications is supported by Apache Hive?
Que 3. Is Hive suitable to be used for OLTP systems? Why?
Ans. No, it is not suitable for OLTP system since it does not offer insert and update at the row level.
Que 4. Where does the data of a Hive table gets stored?
Ans. In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. Moreover, by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml, one can change it.
Que 5. What is a metastore in Hive?
Ans. Basically, to store the metadata information in the Hive we use Metastore. Though, it is possible by using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus. That converts the object representation into the relational schema and vice versa.
Que 6. Why does Hive not store metadata information in HDFS?
Ans. Using RDBMS instead of HDFS, Hive stores metadata information in the metastore. Basically, to achieve low latency we use RDBMS. Because HDFS read/write operations are time-consuming processes.
Que 7. What is the difference between local and remote metastore?
Ans. Local Metastore:
It is the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.
In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.
Que 8. What is the default database provided by Apache Hive for metastore?
Ans. It offers an embedded Derby database instance backed by the local disk for the metastore, by default. It is what we call embedded metastore configuration.
Que 9. What is the difference between the external table and managed table?
Ans. Managed table
The metadata information along with the table data is deleted from the Hive warehouse directory if one drops a managed table.\
Hive just deletes the metadata information regarding the table. Further, it leaves the table data present in HDFS untouched.
Que 10. Is it possible to change the default location of a managed table?
Ans. Yes, by using the clause – LOCATION ‘<hdfs_path>’ we can change the default location of a managed table.
Hive Interview Questions for Freshers- Q. 1,2,3,4,5,7,8,9,10
Hive Interview Questions for Experience- Q. 6
Que 11. When should we use SORT BY instead of ORDER BY?
Ans. Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer. Hence, using ORDER BY will take a lot of time to execute a large number of inputs.
Que 12. What is a partition in Hive?
Ans. Basically, for the purpose of grouping similar type of data together on the basis of column or partition key, Hive organizes tables into partitions. Moreover, to identify a particular partition each table can have one or more partition keys. On defining Hive Partition, in other words, it is a sub-directory in the table directory.
Que 13. Why do we perform partitioning in Hive?
Ans. In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.
Que 14. What is dynamic partitioning and when is it used?
Ans. Dynamic partitioning values for partition columns are known in the runtime. In other words, it is known during loading of the data into a Hive table.
- While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it decreases the query latency.
- Also, while we do not know all the values of the partitions beforehand. Thus, finding these partition values manually from a huge dataset is a tedious task.
Que 15. Why do we need buckets?
Ans. Basically, for performing bucketing to a partition there are two main reasons:
- A map side join requires the data belonging to a unique join key to be present in the same partition.
- It allows us to decrease the query time. Also, makes the sampling process more efficient.
Que 16. How Hive distributes the rows into buckets?
Ans. By using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Hive determines the bucket number for a row. Basically, hash_function depends on the column data type. Although, hash_function for integer data type will be:
hash_function (int_type_column)= value of int_type_column
Que 17. What is indexing and why do we need it?
Ans. Hive index is a Hive query optimization techniques. Basically, we use it to speed up the access of a column or set of columns in a Hive database. Since, the database system does not need to read all rows in the table to find the data with the use of the index, especially that one has selected.
Que 18. What is the use of Hcatalog?
Ans. Basically, to share data structures with external systems we use Hcatalog. It offers access to hive metastore to users of other tools on Hadoop. Hence, they can read and write data to hive’s data warehouse.
Que 19. Where is table data stored in Apache Hive by default?
Ans. hdfs: //namenode_server/user/hive/warehouse
Que 20. Are multi-line comments supported in Hive?
Hive Interview Questions for Freshers- Q. 12,13,14,15,17,18,19,20
Hive Interview Questions for Experience- Q. 11,16
Que 21. What is ObjectInspector functionality?
Ans. To analyze the structure of individual columns and the internal structure of the row objects we use ObjectInspector. Basically, it provides access to complex objects which can be stored in multiple formats in Hive.
Que 22. Explain about the different types of join in Hive.
Ans. There are 4 different types of joins in HiveQL –
- JOIN- It is very similar to Outer Join in SQL
- FULL OUTER JOIN – This join Combines the records of both the left and right outer tables. Basically, that fulfill the join condition.
- LEFT OUTER JOIN- Through this Join, All the rows from the left table are returned even if there are no matches in the right table.
- RIGHT OUTER JOIN – Here also, all the rows from the right table are returned even if there are no matches in the left table.
Que 23. How can you configure remote metastore mode in Hive?
Ans. Basically, hive-site.xml file has to be configured with the below property, to configure metastore in Hive –
thrift: //node1 (or IP Address):9083
IP address and port of the metastore host
Que 24. Is it possible to change the default location of Managed Tables in Hive, if so how?
Ans. Yes, by using the LOCATION keyword while creating the managed table, we can change the default location of Managed tables. But the one condition is, the user has to specify the storage path of the managed table as the value of the LOCATION keyword.
Que 25. How does data transfer happen from HDFS to Hive?
Ans. Basically, the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. But only if data is already present in HDFS. Hence, using the keyword external that creates the table definition in the hive metastore the user just has to define the table.
Create external table table_name (
Que 26. What are the different components of a Hive architecture?
Ans. There are several components of Hive Architecture. Such as –
- User Interface – Basically, it calls the execute interface to the driver. Further, driver creates a session handle to the query. Then sends the query to the compiler to generate an execution plan for it.
- Metastore – It is used to Send the metadata to the compiler. Basically, for the execution of the query on receiving the send MetaData request.
- Compiler- It generates the execution plan. Especially, that is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
- Execute Engine- Basically, by managing the dependencies for submitting each of these stages to the relevant components we use Execute engine.
Que 27. Wherever (Different Directory) I run the hive query, it creates new metastore_db, please explain the reason for it?
Ans. Basically, it creates the local metastore, while we run the hive in embedded mode. Also, it looks whether metastore already exist or not before creating the metastore. Hence, in configuration file hive-site.xml. Property is “javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true” this property is defined. Hence, to change the behavior change the location to the absolute path, thus metastore will be used from that location.
Que 28. Is it possible to use the same metastore by multiple users, in case of the embedded hive?
Ans. No, we cannot use metastore in sharing mode. It is possible to use it in standalone “real” database. Such as MySQL or PostGresSQL.
Que 29. Usage of Hive.
Ans. Here, we will look at following Hive usages.
– We use Hive for Schema flexibility as well as evolution.
– Moreover, it is possible to portion and bucket, tables in Apache Hive.
– Also, we can use JDBC/ODBC drivers, since they are available in Hive.
Que 30. Features and Limitations of Hive.
Ans. Features of Hive
- The best feature is it offers data summarization, query, and analysis in much easier manner.
- To process data without actually storing in HDFS, Hive supports external tables.
- Moreover, it fits the low-level interface requirement of Hadoop perfectly.
- Limitation of Hive
- We can not perform real-time queries with Hive. Also, it does not offer row-level updates.
- Moreover, for interactive data browsing Hive offers acceptable latency.
- Also, we can say Hive is not the right choice for online transaction processing.
Hive Interview Questions for Freshers- Q. 22,24,25,26,28,29,30
Hive Interview Questions for Experience- Q. 21,23,27
As a result, we have we have seen top 30 Hive Interview Questions and Answers. Thus, once you go through it, you will get an in-depth knowledge of questions which may frequently ask in Hive interview. We hope all these questions will help you in preparing well for your Hive interviews ahead.
Moreover, we encourage you to add the Interview Questions in Hive which you came across if you have attended any Hadoop interviews previously. Also, if you want to ask any query regarding Hive Interview Questions feel free to ask in the comment section. We will be happy to answer.