Apache Hive is an open source data warehouse system built on top of Hadoop used for querying and analyzing large datasets stored in Hadoop files. It provides data summarization, query, and ad-hoc analysis. It process structured and semi-structured data in hadoop.
Initially before hive we have to write complex MapReduce jobs. But now with the help of Hive, we just need to submit merely SQL queries. Apache Hive is mainly targeted towards users who are comfortable with SQL. It uses language called HiveQL (HQL). This is similar to SQL. HiveQL (HQL) translates SQL-like queries into MapReduce jobs. The important thing to notice is that there is no need to learn java for Hive.
Hive organizes data into tables. This provides a means for attaching structure to data stored in HDFS.
Before Apache Hive implementation, Facebook faced a lot of challenges as the size of data being generated increased or exploded, making it very difficult to handle them. The traditional RDBMS could not handle the pressure. As a result Facebook was looking out for better options. To solve this impending issue, Facebook initially tried using MapReduce, but with difficulty in programming and mandatory knowledge in SQL, made it an impractical solution. Hence, apache Hive allowed them to overcome the challenges they were facing. With Apache Hive, they are now able to perform the following:
• Schema flexibility and evolution
• Tables can be portioned and bucketed
• Hive tables can be defined directly in the HDFS
• JDBC/ODBC drivers are available
• Extensible – Types, Formats, Functions and scripts