Apache hive 是一个建立在 Hadoop 之上的数据仓库工具,用于从数据中提取有意义的信息。数据仓库就是将来自不同来源的各种数据存储在同一位置。数据主要有 3 种形式,即结构化(SQL 数据库)、半结构化(XML 或 JSON)和非结构化(音乐或视频)。为了处理表格格式中可用的结构化数据,我们在 Hadoop 之上使用了Hive 。 Hive非常强大,可以非常高效地查询 PB 级数据。
我们知道MapReduce是我们在 Hadoop 上使用Java或其他语言进行编程的默认模型,因此Hive主要是为熟悉SQL的开发人员设计的。 Hive诞生后,对Java不太熟悉的人也可以借助Hive在Hadoop上处理数据。使用Hive还可以轻松查询结构数据,因为与Hive相比,用Java编写代码更困难。 HQL 或 HIVEQL 是我们用来处理 hive 的查询语言,其语法与 SQL 语言非常相似,因此使用Hive非常容易。
Apache Hive特性
Features |
Explanation |
---|---|
Supported Computing Engine | Hive supports MapReduce, Tez, and Spark computing engine. |
Framework | Hive is a stable batch-processing framework built on top of the Hadoop Distributed File system and can work as a data warehouse. |
Easy To Code | Hive uses HIVE query language to query structure data which is easy to code. The 100 lines of java code we use to query a structure data can be minimized to 4 lines with HQL. |
Declarative | HQL is a declarative language like SQL means it is non-procedural. |
Structure Of Table | The table, the structure is similar to the RDBMS. It also supports partitioning and bucketing. |
Supported data structures | Partition, Bucket, and tables are the 3 data structures that hive supports. |
Supports ETL | Apache hive supports ETL i.e. Extract Transform and Load. Before Hive python is used for ETL. |
Storage | Hive supports users to access files from HDFS, Apache HBase, Amazon S3, etc. |
Capable | Hive is capable to process very large datasets of Petabytes in size. |
Helps in processing unstructured data | We can easily embed custom MapReduce code with Hive to process unstructured data. |
Drivers | JDBC/ODBC drivers are also available in Hive. |
Fault Tolerance | Since we store Hive data on HDFS so fault tolerance is provided by Hadoop. |
Area of uses | We can use a hive for data mining, predictive modeling, and document indexing. |
Apache Hive限制
Limitation |
Explanation |
---|---|
Does not support OLAP | Apache Hive doesn’t support online transaction processing (OLTP) but Online Analytical Processing(OLAP) is supported. |
No updation and Deletion | Hive does not support update and delete operation on tables. |
Doesn’t support subqueries | Subqueries are not supported. |
Latency | The latency in the apache hive query is very high. |
Only non-real or cold data is supported | Hive is not used for real-time data querying since it takes a while to produce a result. |
Transaction processing is not supported | HQL does not support the Transaction processing feature. |