![]() | This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages) (Learn how and when to remove this message)
|
![]() | |
Original author(s) | Facebook, Inc. |
---|---|
Developer(s) | Contributors |
Initial release | October 1, 2010; 14 years ago (2010-10-01)[1] |
Stable release | |
Repository | github |
Written in | Java |
Operating system | Cross-platform |
Available in | SQL |
Type | Data warehouse |
License | Apache License 2.0 |
Website | hive |
Apache Hive is adata warehouse software project. It is built on top ofApache Hadoop for providing data query and analysis.[3][4] Hive gives an SQL-likeinterface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in theMapReduce Java API to execute SQL applications and queries over distributed data.
Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Hive facilitates the integration of SQL-based querying languages with Hadoop, which is commonly used in data warehousing applications.[5] While initially developed byFacebook, Apache Hive is used and developed by other companies such asNetflix and theFinancial Industry Regulatory Authority (FINRA).[6][7] Amazon maintains a software fork of Apache Hive included inAmazon Elastic MapReduce onAmazon Web Services.[8]
Apache Hive supports the analysis of large datasets stored in Hadoop'sHDFS and compatible file systems such asAmazon S3 filesystem andAlluxio. It provides a SQL-like query language called HiveQL[9] with schema on read and transparently converts queries toMapReduce, Apache Tez[10] andSpark jobs. All three execution engines can run inHadoop's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provided indexes, but this feature was removed in version 3.0[11]Other features of Hive include:
By default, Hive stores metadata in an embeddedApache Derby database, and other client/server databases likeMySQL can optionally be used.[12]
The first four file formats supported in Hive were plain text,[13] sequence file, optimized row columnar (ORC) format[14][15] andRCFile.[16][17]Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13.[18][19]
![]() | This sectionis inlist format but may read better asprose. You can help byconverting this section, if appropriate.Editing help is available.(October 2016) |
Major components of the Hive architecture are:
While based on SQL, HiveQL does not strictly follow the fullSQL-92 standard. HiveQL offers extensions not in SQL, includingmulti-table inserts, andcreates tables as select. HiveQL lacked support fortransactions andmaterialized views, and only limited subquery support.[25][26] Support for insert, update, and delete with fullACID functionality was made available with release 0.14.[27]
Internally, acompiler translates HiveQL statements into adirected acyclic graph ofMapReduce, Tez, orSpark jobs, which are submitted to Hadoop for execution.[28]
The word count program counts the number of times each word occurs in the input. The word count can be written in HiveQL as:[5]
DROPTABLEIFEXISTSdocs;CREATETABLEdocs(lineSTRING);LOADDATAINPATH'input_file'OVERWRITEINTOTABLEdocs;CREATETABLEword_countsASSELECTword,count(1)AScountFROM(SELECTexplode(split(line,'\s'))ASwordFROMdocs)tempGROUPBYwordORDERBYword;
A brief explanation of each of the statements is as follows:
DROPTABLEIFEXISTSdocs;CREATETABLEdocs(lineSTRING);
Checks if tabledocs
exists and drops it if it does. Creates a new table calleddocs
with a single column of typeSTRING
calledline
.
LOADDATAINPATH'input_file'OVERWRITEINTOTABLEdocs;
Loads the specified file or directory (In this case “input_file”) into the table.OVERWRITE
specifies that the target table to which the data is being loaded into is to be re-written; Otherwise, the data would be appended.
CREATETABLEword_countsASSELECTword,count(1)AScountFROM(SELECTexplode(split(line,'\s'))ASwordFROMdocs)tempGROUPBYwordORDERBYword;
The queryCREATE TABLE word_counts AS SELECT word, count(1) AS count
creates a table calledword_counts
with two columns:word
andcount
. This query draws its input from the inner query(SELECTexplode(split(line,'\s'))ASwordFROMdocs)temp"
. This query serves to split the input words into different rows of a temporary table aliased astemp
. TheGROUPBYWORD
groups the results based on their keys. This results in thecount
column holding the number of occurrences for each word of theword
column. TheORDERBYWORDS
sorts the words alphabetically.
The storage and querying operations of Hive closely resemble those of traditional databases. While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of theHadoop ecosystem, and has to comply with the restrictions of Hadoop andMapReduce.
A schema is applied to a table in traditional databases. In such traditional databases, the table typically enforces the schema when the data is loaded into the table. This enables the database to make sure that the data entered follows the representation of the table as specified by the table definition. This design is calledschema on write. In comparison, Hive does not verify the data against the table schema on write. Instead, it subsequently does run time checks when the data is read. This model is calledschema on read.[25] The two approaches have their own advantages and drawbacks.
Checking data against table schema during the load time adds extra overhead, which is why traditional databases take a longer time to load data. Quality checks are performed against the data at the load time to ensure that the data is not corrupt. Early detection of corrupt data ensures early exception handling. Since the tables are forced to match the schema after/during the data load, it has better query time performance. Hive, on the other hand, can load data dynamically without any schema check, ensuring a fast initial load, but with the drawback of comparatively slower performance at query time. Hive does have an advantage when the schema is not available at the load time, but is instead generated later dynamically.[25]
Transactions are key operations in traditional databases. As any typicalRDBMS, Hive supports all four properties of transactions (ACID):Atomicity,Consistency,Isolation, andDurability. Transactions in Hive were introduced in Hive 0.13 but were only limited to the partition level.[29] The recent version of Hive 0.14 had these functions fully added to support completeACID properties. Hive 0.14 and later provides different row level transactions such asINSERT,DELETE andUPDATE.[30] EnablingINSERT,UPDATE, andDELETE transactions require setting appropriate values for configuration properties such ashive.support.concurrency
,hive.enforce.bucketing
, andhive.exec.dynamic.partition.mode
.[31]
Hive v0.7.0 added integration with Hadoop security. Hadoop began usingKerberos authorization support to provide security. Kerberos allows for mutual authentication between client and server. In this system, the client's request for a ticket is passed along with the request. The previous versions of Hadoop had several issues such as users being able to spoof their username by setting thehadoop.job.ugi
property and also MapReduce operations being run under the same user: Hadoop or mapred. With Hive v0.7.0's integration with Hadoop security, these issues have largely been fixed. TaskTracker jobs are run by the user who launched it and the username can no longer be spoofed by setting thehadoop.job.ugi
property. Permissions for newly created files in Hive are dictated by theHDFS. The Hadoop distributed file system authorization model uses three entities: user, group and others with three permissions: read, write and execute. The default permissions for newly created files can be set by changing the unmask value for the Hive configuration variablehive.files.umask.value
.[5]
{{cite conference}}
: CS1 maint: multiple names: authors list (link){{cite conference}}
: CS1 maint: multiple names: authors list (link){{cite conference}}
: CS1 maint: multiple names: authors list (link){{cite conference}}
: CS1 maint: multiple names: authors list (link)