Movatterモバイル変換


[0]ホーム

URL:


CN113326258A - Hash connection method, device and system, electronic equipment and computer storage medium - Google Patents

Hash connection method, device and system, electronic equipment and computer storage medium
Download PDF

Info

Publication number
CN113326258A
CN113326258ACN202010623534.5ACN202010623534ACN113326258ACN 113326258 ACN113326258 ACN 113326258ACN 202010623534 ACN202010623534 ACN 202010623534ACN 113326258 ACN113326258 ACN 113326258A
Authority
CN
China
Prior art keywords
hash
data
row
rows
connection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010623534.5A
Other languages
Chinese (zh)
Inventor
谢小龙
吕政�
金天波
吴宇昊
沈国权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding LtdfiledCriticalAlibaba Group Holding Ltd
Priority to CN202010623534.5ApriorityCriticalpatent/CN113326258A/en
Publication of CN113326258ApublicationCriticalpatent/CN113326258A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本申请实施例提供了一种哈希连接方法、装置、系统、电子设备及计算机存储介质,涉及数据库领域。其中,所述方法包括:基于待连接的数据表中的多行数据的哈希值,确定所述多行数据在目标数据表的哈希表中分别对应的哈希入口;基于所述哈希入口,确定所述多行数据在所述哈希表中分别对应的哈希桶;若确定所述多行数据分别与对应的所述哈希桶中存放的所述目标数据表的行数据满足等值连接条件,则对所述多行数据与对应的所述哈希桶中存放的所述目标数据表的行数据进行哈希连接,以获得所述待连接的数据表与所述目标数据表的哈希连接结果。通过本申请实施例,能够有效提升哈希连接的性能,进而提升使用哈希连接的数据库的整体性能。

Figure 202010623534

Embodiments of the present application provide a hash connection method, apparatus, system, electronic device, and computer storage medium, which relate to the field of databases. Wherein, the method includes: based on the hash values of the multiple rows of data in the data table to be connected, determining the respective hash entries corresponding to the multiple rows of data in the hash table of the target data table; Entry, determine the respective hash buckets corresponding to the multiple rows of data in the hash table; if it is determined that the multiple rows of data and the row data of the target data table stored in the corresponding hash buckets satisfy Equivalent connection condition, then hash connection is performed between the multi-row data and the row data of the target data table stored in the corresponding hash bucket to obtain the data table to be connected and the target data. The result of the hash join of the tables. Through the embodiments of the present application, the performance of the hash connection can be effectively improved, thereby improving the overall performance of the database using the hash connection.

Figure 202010623534

Description

Hash connection method, device and system, electronic equipment and computer storage medium
Technical Field
The embodiment of the invention relates to the field of databases, in particular to a hash connection method, a hash connection device, a hash connection system, electronic equipment and a computer storage medium.
Background
Join statements in a database, such as SQL (Structured Query Language), may combine two or more tables in the database. The hash join algorithm is a common join algorithm in most databases, and can realize the join by using a hash table. The hash join algorithm generally comprises two phases: a hash table construction phase and a hash table probing phase. In the hash table probing stage, only one row of data is probed in each hash table probing, i.e. after one row of data is probed, the next row of data is probed. Therefore, the detection efficiency of the hash table is low, the performance of hash connection is affected, and the overall performance of the database is further affected. Therefore, how to effectively improve the performance of hash connection becomes a technical problem to be solved urgently at present.
Disclosure of Invention
In view of the above, embodiments of the present invention provide a hash join scheme to at least partially solve the above technical problems.
According to a first aspect of the embodiments of the present invention, a hash join method is provided. The method comprises the following steps: determining hash entries respectively corresponding to a plurality of rows of data in a hash table of a target data table based on hash values of the plurality of rows of data in the data table to be connected; determining hash buckets corresponding to the lines of data in the hash table respectively based on the hash entries; and if the fact that the rows of data respectively meet the row data of the target data table stored in the corresponding hash bucket with the corresponding row data of the target data table is determined, performing hash connection on the rows of data and the row data of the target data table stored in the corresponding hash bucket to obtain a hash connection result of the data table to be connected and the target data table.
According to a second aspect of the embodiments of the present invention, there is provided a hash connection apparatus. The device comprises: the first determining module is used for determining hash entries respectively corresponding to a plurality of rows of data in a hash table of a target data table based on hash values of the plurality of rows of data in the data table to be connected; a second determining module, configured to determine, based on the hash entries, hash buckets corresponding to the multiple lines of data in the hash table, respectively; and the hash connection module is used for performing hash connection on the plurality of lines of data and the corresponding line data of the target data table stored in the hash bucket to obtain a hash connection result of the data table to be connected and the target data table if the fact that the plurality of lines of data and the corresponding line data of the target data table stored in the hash bucket respectively meet the equivalent connection condition is determined.
According to a third aspect of embodiments of the present invention, there is provided a database system. The system comprises: the client is used for receiving operation used for indicating hash connection of a data table to be connected and sending a hash connection request used for requesting the hash connection of the data table to be connected to the database server based on the operation; the database server is configured to determine hash values of multiple rows of data in the data table to be connected based on the received hash connection request, determine hash entries corresponding to the multiple rows of data in a hash table of a target data table based on the hash values of the multiple rows of data in the data table to be connected, determine hash buckets corresponding to the multiple rows of data in the hash table based on the hash entries, and if it is determined that the multiple rows of data and the row data of the target data table stored in the corresponding hash bucket satisfy an equivalent connection condition, perform hash connection on the multiple rows of data and the row data of the target data table stored in the corresponding hash bucket to obtain a hash connection result between the data table to be connected and the target data table.
According to a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the hash connection method according to the first aspect.
According to a fifth aspect of embodiments of the present invention, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the hash join method according to the first aspect.
According to the hash connection scheme provided by the embodiment of the invention, hash entries corresponding to a plurality of rows of data in the hash table of the target data table are determined based on hash values of a plurality of rows of data in the data table to be connected, and hash buckets corresponding to a plurality of rows of data in the hash table are determined based on the hash entries; if it is determined that the rows of data respectively satisfy the equivalent connection condition with the row data of the target data table stored in the corresponding hash bucket, hash connection is performed on the rows of data and the row data of the target data table stored in the corresponding hash bucket to obtain a hash connection result of the data table to be connected and the target data table, the rows of data in the data table to be connected can be detected by the hash table at one time, that is, the rows of data in the data table to be connected can be detected by the hash table at one time, the efficiency of hash table detection on the row data in the data table to be connected can be effectively improved, and further, the performance of hash connection and the overall performance of a database using hash connection can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and it is also possible for a person skilled in the art to obtain other drawings based on the drawings.
FIG. 1A is a flowchart illustrating the steps of a method for Hash join according to an embodiment of the present invention;
fig. 1B is a schematic structural diagram of a hash table according to an embodiment of the present application;
fig. 1C is a schematic diagram of a hash join method according to an embodiment of the present application;
FIG. 2 is a diagram of a database system according to a second embodiment of the present application;
FIG. 3 is a schematic structural diagram of a three-split half-huh connection device according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a Hull connection device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device in the fifth embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention shall fall within the scope of the protection of the embodiments of the present invention.
The following further describes specific implementation of the embodiments of the present invention with reference to the drawings.
Referring to fig. 1A, a flowchart of steps of a hash join method according to a first embodiment of the present application is shown.
Specifically, the hash connection method provided in this embodiment includes the following steps:
in step S101, hash entries corresponding to multiple rows of data in the hash table of the target data table are determined based on hash values of the multiple rows of data in the data table to be connected.
In the embodiment of the present application, the data table to be connected may be understood as a data table waiting for performing a connection operation in relational algebra. The hash value of the plurality of lines of data may be understood as a value obtained by hashing the plurality of lines of data. The hash entries of the plurality of lines of data respectively corresponding to the hash table of the target data table may be entry pointers of the plurality of lines of data respectively corresponding to the hash table of the target data table, indexes of the plurality of lines of data respectively corresponding to the hash table of the target data table, or array subscripts of the plurality of lines of data respectively corresponding to the hash table of the target data table, and the like. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In a specific example, the hash table of the target data table may be generated based on the locally stored target data table before determining the hash entries corresponding to the plurality of rows of data in the hash table of the target data table, respectively. In the process of generating the hash table of the target data table, the data in the target data table can be sorted according to the similarity of the connection primary keys, so that the data in the hash table of the target data table are ensured to be distributed according to the connection primary keys as much as possible, and the execution of hash connection is facilitated. Here, the manner of generating the hash table differs depending on the database, and for example, the hash table of the target data table may be generated in an array manner, a linked list manner, or the like. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In one implementation, a process of generating a hash table is described by taking a compute node 1 in a massively Parallel processing computer (MPP) architecture database as an example. Two internal tables exist in compute node 1: the order (order) table and Customer (Customer) table are shown in tables 1 and 2, respectively.
Figure BDA0002561092920000051
TABLE 1
Figure BDA0002561092920000052
TABLE 2
Suppose that a user submits a query request a to the MPP architecture database as follows: select c _ custkey, o _ order, o _ shift from customer, order where c _ custkey is o _ custkey;
and the coordinating node of the MPP framework database receives the query request A and issues corresponding query tasks to each computing node in the MPP framework database. After receiving the query task, the computing node 1 learns that the order form and the customer form need to be hashed and connected according to the custkey, and then the computing node 1 hashes and connects the order form and the customer form according to the custkey to form a hash table shown in the following table 3. The value of the hash key (hashkey) can be obtained by calculating through a corresponding hash algorithm, and the connection predicate is the custkey in the order list and the customer list.
Figure BDA0002561092920000061
TABLE 3
It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In some optional embodiments, when determining the hash entries corresponding to the rows of data in the hash table of the target data table respectively based on the hash values of the rows of data in the data table to be connected, retrieving the hash entry corresponding to each row of data in the hash table respectively based on the hash value of the connecting primary key of each row of data in the rows of data. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In a specific example, the Join (Join) may be understood as a basic operation in a database for matching two data tables or multiple data tables. There are many variations such as inner join (inner join), anti join (anti join), etc. The Join Key (Join Key) may be understood as a row to be compared when joining, and may be one or more rows. And if the corresponding connection main keys of the data of the two rows of the two data tables are the same, the data of the two rows of the two data tables are successfully connected. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In some optional embodiments, when retrieving the hash entry corresponding to each line of data in the hash table based on the hash value of the connected primary key of each line of data in the plurality of lines of data, the identification information of the hash entry corresponding to each line of data in the hash table is calculated in parallel based on the hash value of the connected primary key of each line of data in the plurality of lines of data by a plurality of threads for calculating the identification information of the hash entry; and determining the hash entry corresponding to each line of data in the hash table based on the identification information of the hash entry corresponding to each line of data in the hash table. Therefore, the retrieval efficiency of the hash entries corresponding to each row of data in the hash table can be effectively improved. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In a specific example, the identification information of the hash entry may be a number of the hash entry. When the identification information of the hash entries corresponding to each row of data in the hash table is calculated in parallel through a plurality of threads for calculating the identification information of the hash entries and based on the hash values of the main keys connected to each row of data in the hash table, the identification information of the hash entries corresponding to each row of data in the hash table is calculated in parallel through each thread for calculating the identification information of the hash entries in the plurality of threads for calculating the identification information of the hash entries and based on the hash values of the main keys connected to each row of data in the hash table. More specifically, performing modulo operation on the hash value of each line of data in the plurality of lines of data, which is connected with the main key, through each thread in a plurality of threads for calculating the identification information of the hash entry, so as to obtain the identification information of the hash entry corresponding to each line of data in the hash table. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In step S102, based on the hash entries, hash buckets corresponding to the rows of data in the hash table respectively are determined.
In the embodiments of the present application, the hash bucket may be understood as a "data block" storing a plurality of records. The size of the hash bucket is fixed, i.e. only a fixed number of collisions can be handled. As shown in fig. 1B, the hash table is composed of hash entries and hash buckets. Each hash entry correspondence includes one or more hash buckets. One or more lines of data and one or more lines of hash data of the data are correspondingly stored in each hash bucket. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In some optional embodiments, after the parallel computation of the identification information of the hash entry corresponding to each line of data in the hash table based on the hash value of the primary key connected to each line of data in the plurality of lines of data by a plurality of threads for computing the identification information of the hash entry, the method further includes: caching identification information of hash entries respectively corresponding to each row of data in the hash table; the determining, based on the hash entry, hash buckets corresponding to the plurality of lines of data in the hash table respectively includes: and searching a hash bucket corresponding to each row of data in the hash table in the hash entry identified by the cached identification information. Specifically, in the hash entry identified by the cached identification information, traversing a hash bucket corresponding to each row of data in the plurality of rows of data in the hash table respectively. Wherein the hash entry includes one or more hash buckets. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In some optional embodiments, when determining the hash buckets respectively corresponding to the plurality of lines of data in the hash table based on the hash entry, by a plurality of hash bucket lookup threads, in the hash entry, concurrently looking up each line of data in the plurality of lines of data in the hash table respectively corresponding to the hash buckets in the hash table. Therefore, the searching efficiency of the hash bucket corresponding to each row of data in the multiple rows of data in the hash table can be effectively improved. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In a specific example, when the hash entry is searched for the hash bucket corresponding to each row of data in the hash table in parallel by a plurality of hash bucket lookup threads, each hash bucket lookup thread in the hash entry is searched for the hash bucket corresponding to each row of data in the hash table in parallel by each hash bucket lookup thread in the plurality of hash bucket lookup threads. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In step S103, if it is determined that the rows of data and the rows of data of the target data table stored in the corresponding hash bucket respectively satisfy an equivalent connection condition, performing hash connection on the rows of data and the rows of data of the target data table stored in the corresponding hash bucket to obtain a hash connection result between the data table to be connected and the target data table.
In the embodiment of the present application, the equivalent join condition may be understood as a common join condition of a relational operation-join operation, which is a special case of conditional join (or θ join) when a join operator is a "═ sign, that is, θ equals 0. The Hash (Hash) is understood to mean that for a given datum, its corresponding Hash value is calculated. The hash values calculated for the same data must be identical. The Hash Join (Hash Join) can be understood as a Hash-based Join algorithm, and the basic idea is to create a Hash table from one data table, and then search each row of data of the other data table in the Hash table row by row. The hash table helps to improve the retrieval speed. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In some optional embodiments, before hashing the rows of data with the rows of the target data table stored in the corresponding hash bucket, the method further comprises: for each row of data in the plurality of rows of data, if the hash value of the row of data connected with the main key is the same as the hash value of the row of data connected with the main key of the target data table stored in the corresponding hash bucket, determining whether the hash keys of the row of data are in one-to-one correspondence with the hash keys of the row of data of the target data table stored in the corresponding hash bucket; and if the hash keys of the line data are determined to be in one-to-one correspondence with the hash keys of the line data of the target data table stored in the corresponding hash bucket, determining that the line data and the line data of the target data table stored in the corresponding hash bucket meet the equivalent connection condition. Therefore, whether the row data and the row data of the target data table stored in the corresponding hash bucket meet the equivalent connection condition can be effectively determined through the hash value of the row data connected with the main key and the hash key of the row data. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In a specific example, the hash value of the row data connected to the primary key may be a value obtained by hashing the row data connected to the primary key. The hash key of the row data can be understood as a key value obtained by hashing a field used by a connection condition of the row data. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In some optional embodiments, when the multiple rows of data are hashed with the row data of the target data table stored in the corresponding hash bucket, each row of data in the multiple rows of data is hashed with the row data of the target data table stored in the corresponding hash bucket in parallel through multiple hash connection threads. Therefore, the efficiency of performing hash connection on each row of data in the rows of data and the row of data of the target data table stored in the corresponding hash bucket can be effectively improved. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In a specific example, when hash connection is performed in parallel on each line of data in the plurality of lines of data and the line of data of the target data table stored in the corresponding hash bucket through a plurality of hash connection threads, hash connection is performed in parallel on each line of data in the plurality of lines of data and the line of data of the target data table stored in the corresponding hash bucket through each hash connection thread in the plurality of hash connection threads. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
In a specific example, as shown in fig. 1C, the hash join process is as follows: the method comprises the steps of firstly calculating hash values of main keys for connecting a plurality of rows of data in a data table to be connected, and then retrieving hash entries corresponding to each row of data in the hash table based on the hash values of the main keys for connecting the rows of data in the data table to be connected. After the hash entry corresponding to each row of data in the hash table is retrieved, traversing the hash bucket corresponding to each row of data in the hash table in the hash entry. Specifically, whether the hash value of the row data connecting main key is the same as the hash value of the row data connecting main key of the target data table stored in the corresponding hash bucket is determined, if it is determined that the hash value of the row data connecting main key is the same as the hash value of the row data connecting main key of the target data table stored in the corresponding hash bucket, it is determined whether the hash key of the row data is in one-to-one correspondence with the hash key of the row data of the target data table stored in the corresponding hash bucket, and if it is determined that the hash key of the row data is in one-to-one correspondence with the hash key of the row data of the target data table stored in the corresponding hash bucket, it is determined that the row data and the row data of the target data table stored in the corresponding hash bucket satisfy an equivalent connection condition, and thus a hash connection result of the row data and the row data of the target data table stored in the corresponding hash bucket can be obtained. And traversing the next hash bucket in the hash entry until the hash bucket in the hash entry is traversed if the hash value of the row data connected with the main key is determined to be different from the hash value of the row data connected with the main key of the target data table stored in the corresponding hash bucket. And traversing the next hash bucket in the hash entry until the hash bucket in the hash entry is traversed if the hash key of the line data is determined to be not corresponding to the hash key of the line data of the target data table stored in the corresponding hash bucket. It should be understood that the above description is only exemplary, and the embodiments of the present application are not limited in this respect.
According to the hash connection method provided by the embodiment of the invention, hash entries corresponding to a plurality of rows of data in the hash table of the target data table are determined based on hash values of a plurality of rows of data in the data table to be connected, and hash buckets corresponding to a plurality of rows of data in the hash table are determined based on the hash entries; if it is determined that the rows of data respectively satisfy the equivalent connection condition with the row data of the target data table stored in the corresponding hash bucket, hash connection is performed on the rows of data and the row data of the target data table stored in the corresponding hash bucket to obtain a hash connection result of the data table to be connected and the target data table, the rows of data in the data table to be connected can be detected by the hash table at one time, that is, the rows of data in the data table to be connected can be detected by the hash table at one time, the efficiency of hash table detection on the row data in the data table to be connected can be effectively improved, and further, the performance of hash connection and the overall performance of a database using hash connection can be effectively improved.
The hash connection method provided by the present embodiment may be executed by any suitable device with data processing capability, including but not limited to: a camera, a terminal, a mobile terminal, a PC, a server, an in-vehicle device, an entertainment device, an advertising device, a Personal Digital Assistant (PDA), a tablet, a laptop, a handheld game machine, glasses, a watch, a wearable device, a virtual display device, a display enhancement device, or the like.
Referring to fig. 2, a schematic diagram of a database system in the second embodiment of the present application is shown.
The database system provided by the embodiment comprises: the client is used for receiving operation used for indicating hash connection of a data table to be connected and sending a hash connection request used for requesting the hash connection of the data table to be connected to the database server based on the operation; the database server is configured to determine hash values of multiple rows of data in the data table to be connected based on the received hash connection request, determine hash entries corresponding to the multiple rows of data in a hash table of a target data table based on the hash values of the multiple rows of data in the data table to be connected, determine hash buckets corresponding to the multiple rows of data in the hash table based on the hash entries, and if it is determined that the multiple rows of data and the row data of the target data table stored in the corresponding hash bucket satisfy an equivalent connection condition, perform hash connection on the multiple rows of data and the row data of the target data table stored in the corresponding hash bucket to obtain a hash connection result between the data table to be connected and the target data table.
Referring to fig. 2, a schematic structural diagram of a database system for implementing the hash connection method provided in the embodiment of the present application, where the system may include a database server and a client in a terminal device a, it should be understood that the database server and the client in the terminal device a presented in fig. 2 are only exemplary and are not limited to the implementation forms of the database server and the client in the terminal device a.
In practical application, the database server and the terminal device a may be connected by a wired or wireless network, and may specifically realize communication connection through mobile networks such as GSM, GPRS, LTE, and the like, or perform communication connection through modes such as bluetooth, WIFI, infrared, and the like.
The database server may be a server provided in a service device for providing services for a user, and specifically may be an independent application service device.
The terminal device a may be a user-oriented terminal capable of interacting with a user, such as a mobile phone, a notebook, a computer, an iPad, an intelligent audio, and the like, and may be various self-service terminals, such as self-service machines in places such as hospitals, banks, stations, and the like, and in addition, the terminal device a may also be an intelligent machine supporting interaction, such as a chat robot, a floor sweeping robot, a meal ordering service robot, and the like. The product type and the physical form of the terminal equipment are not limited, and the terminal equipment needs to have an interactive function and can be realized by installing interactive application programs such as database and the like.
When performing hash connection, the client in the terminal device a may send a hash connection request to the database server through the network. The database server receives a hash connection request sent by a client side in the terminal device A, and performs hash connection on a data table to be connected based on the hash connection request. Therefore, the hash connection method provided in the embodiment of the present application may be executed by a database server, and a specific implementation process may refer to the description of the first method embodiment.
Through the database system provided by the embodiment of the application, a client receives an operation for indicating hash connection of a data table to be connected, and sends a hash connection request for requesting hash connection of the data table to be connected to the database server based on the operation, the database server determines hash values of multiple rows of data in the data table to be connected based on the received hash connection request, determines hash entries respectively corresponding to the multiple rows of data in a hash table of a target data table based on the hash values of the multiple rows of data in the data table to be connected, determines hash buckets respectively corresponding to the multiple rows of data in the hash table based on the hash entries, and if the multiple rows of data respectively satisfy an equivalent connection condition with the row data of the target data table stored in the corresponding hash buckets, and performing hash connection on the multi-row data and the corresponding row data of the target data table stored in the hash bucket to obtain a hash connection result between the data table to be connected and the target data table, and performing hash table detection on the multi-row data in the data table to be connected at one time, namely performing hash table detection on batch row data in the data table to be connected at one time, so that the efficiency of performing hash table detection on the row data in the data table to be connected can be effectively improved, and further, the performance of hash connection and the overall performance of a database using hash connection can be effectively improved.
The database system provided in this embodiment is used to implement the corresponding hash connection method in the foregoing method embodiments, and details are not described here.
Referring to fig. 3, a schematic structural diagram of a third huh connection device according to an embodiment of the present application is shown.
The hash connection apparatus provided in this embodiment includes: a first determining module 201, configured to determine, based on hash values of multiple rows of data in a data table to be connected, hash entries corresponding to the multiple rows of data in a hash table of a target data table, respectively; a second determining module 202, configured to determine, based on the hash entries, hash buckets corresponding to the lines of data in the hash table, respectively; and the hash connection module 203 is configured to perform hash connection on the multiple lines of data and the corresponding line data of the target data table stored in the hash bucket if it is determined that the multiple lines of data and the corresponding line data of the target data table stored in the hash bucket respectively satisfy an equivalent connection condition, so as to obtain a hash connection result between the data table to be connected and the target data table.
The hash connection apparatus provided in this embodiment is used to implement the corresponding hash connection method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.
Referring to fig. 4, a schematic structural diagram of a huh-in-four connection device according to an embodiment of the present application is shown.
The hash connection apparatus provided in this embodiment includes: a first determining module 301, configured to determine, based on hash values of multiple rows of data in a data table to be connected, hash entries corresponding to the multiple rows of data in a hash table of a target data table, respectively; a second determining module 302, configured to determine, based on the hash entries, hash buckets corresponding to the rows of data in the hash table, respectively; a hash connection module 305, configured to perform hash connection on the multiple lines of data and the line data of the target data table stored in the corresponding hash bucket if it is determined that the multiple lines of data and the line data of the target data table stored in the corresponding hash bucket respectively satisfy an equal value connection condition, so as to obtain a hash connection result between the data table to be connected and the target data table.
Optionally, the first determining module 301 includes: and the retrieval submodule 3011 is configured to retrieve, based on the hash value of each line of data in the plurality of lines of data, a hash entry corresponding to each line of data in the hash table.
Optionally, the retrieving sub-module 3011 includes: a calculating unit 3012, configured to calculate, in parallel, identification information of hash entries corresponding to each line of data in the hash table based on hash values of the primary keys connected to each line of data in the plurality of lines of data through a plurality of threads for calculating identification information of the hash entries; a determining unit 3014, configured to determine, based on identification information of hash entries corresponding to the lines of data in the hash table, hash entries corresponding to the lines of data in the hash table respectively.
Optionally, after the computing unit 3012, the retrieving sub-module 3011 further includes: a caching unit 3013, configured to cache identification information of hash entries corresponding to the rows of data in the hash table respectively; the second determining module 302 is specifically configured to: and searching a hash bucket corresponding to each row of data in the hash table in the hash entry identified by the cached identification information.
Optionally, the second determining module 302 is specifically configured to: and searching hash buckets corresponding to each row of data in the hash table in the hash entry in parallel through a plurality of hash bucket searching threads.
Optionally, before the hash connection module 305, the apparatus further includes: a third determining module 303, configured to determine, for each row of data in the multiple rows of data, whether hash keys of a row of data correspond to hash keys of a row of data of the target data table stored in a corresponding hash bucket one by one if it is determined that hash values of connecting primary keys of a row of data are the same as hash values of connecting primary keys of a row of data of the target data table stored in the corresponding hash bucket; a fourth determining module 304, configured to determine that line data and line data of the target data table stored in the corresponding hash bucket satisfy an equivalent connection condition if it is determined that the hash key of the line data and the hash key of the line data of the target data table stored in the corresponding hash bucket correspond to each other one by one.
Optionally, the hash connection module 305 is specifically configured to: and performing hash connection on each line of data in the plurality of lines of data and the corresponding line of data of the target data table stored in the hash bucket in parallel through a plurality of hash connection threads.
The hash connection apparatus provided in this embodiment is used to implement the corresponding hash connection method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.
Referring to fig. 5, a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention is shown, and the specific embodiment of the present invention does not limit the specific implementation of the electronic device.
As shown in fig. 5, the electronic device may include: a processor (processor)402, aCommunications Interface 404, a memory 406, and a Communications bus 408.
Wherein:
the processor 402,communication interface 404, and memory 406 communicate with each other via a communication bus 408.
Acommunication interface 404 for communicating with other electronic devices or servers.
The processor 402 is configured to execute theprogram 410, and may specifically perform relevant steps in the hash join method embodiment described above.
In particular,program 410 may include program code comprising computer operating instructions.
The processor 402 may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And a memory 406 for storing aprogram 410. Memory 406 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
Theprogram 410 may specifically be configured to cause the processor 402 to perform the following operations: determining hash entries respectively corresponding to a plurality of rows of data in a hash table of a target data table based on hash values of the plurality of rows of data in the data table to be connected; determining hash buckets corresponding to the lines of data in the hash table respectively based on the hash entries; and if the fact that the rows of data respectively meet the row data of the target data table stored in the corresponding hash bucket with the corresponding row data of the target data table is determined, performing hash connection on the rows of data and the row data of the target data table stored in the corresponding hash bucket to obtain a hash connection result of the data table to be connected and the target data table.
In an optional implementation, theprogram 410 is further configured to cause the processor 402 to, when determining, based on hash values of multiple rows of data in the data table to be connected, corresponding hash entries in the hash table of the target data table for the multiple rows of data, retrieve, based on the hash value of the connecting primary key of each row of data in the multiple rows of data, the corresponding hash entries in the hash table for the each row of data.
In an alternative embodiment, theprogram 410 is further configured to cause the processor 402 to, when retrieving the hash entry corresponding to each row of data in the hash table based on the hash value of the connected primary key of each row of data in the plurality of rows, concurrently calculate, by a plurality of threads for calculating identification information of the hash entry, the identification information of the hash entry corresponding to each row of data in the hash table based on the hash value of the connected primary key of each row of data in the plurality of rows; and determining the hash entry corresponding to each line of data in the hash table based on the identification information of the hash entry corresponding to each line of data in the hash table.
In an alternative embodiment, theprogram 410 is further configured to cause the processor 402 to cache the identification information of the hash entry corresponding to each line of data in the hash table after parallel computing the identification information of the hash entry corresponding to each line of data in the hash table based on the hash value of the primary key connecting to each line of data in the plurality of lines of data through a plurality of threads for computing the identification information of the hash entry; and when determining hash buckets corresponding to the multiple lines of data in the hash table respectively based on the hash entries, searching the hash bucket corresponding to each line of data in the hash table respectively in the hash entry identified by the cached identification information.
In an alternative embodiment, theprogram 410 is further configured to cause the processor 402 to, when determining the hash buckets corresponding to the rows of data in the hash table respectively based on the hash entries, search, by a plurality of hash bucket search threads, in the hash entries, the hash buckets corresponding to each row of data in the hash table respectively in the rows of data in parallel.
In an optional implementation, theprogram 410 is further configured to, before hash-connecting the plurality of rows of data with the row data of the target data table stored in the corresponding hash bucket, for each row of data in the plurality of rows of data, determine whether hash keys of a row data correspond to hash keys of a row data of the target data table stored in the corresponding hash bucket one-to-one if it is determined that hash values of connected primary keys of a row data of the target data table stored in the corresponding hash bucket are the same; and if the hash keys of the line data are determined to be in one-to-one correspondence with the hash keys of the line data of the target data table stored in the corresponding hash bucket, determining that the line data and the line data of the target data table stored in the corresponding hash bucket meet the equivalent connection condition.
In an optional implementation, theprogram 410 is further configured to, when performing hash connection on the plurality of lines of data and the corresponding line data of the target data table stored in the hash bucket, perform hash connection on each line of data in the plurality of lines of data and the corresponding line data of the target data table stored in the hash bucket in parallel through a plurality of hash connection threads.
For specific implementation of each step in theprogram 410, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing hash join method embodiment, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.
Through the electronic device in the embodiment, hash entries corresponding to a plurality of rows of data in a hash table of a target data table are determined based on hash values of the plurality of rows of data in the data table to be connected, and hash buckets corresponding to the plurality of rows of data in the hash table are determined based on the hash entries; if it is determined that the rows of data respectively satisfy the equivalent connection condition with the row data of the target data table stored in the corresponding hash bucket, hash connection is performed on the rows of data and the row data of the target data table stored in the corresponding hash bucket to obtain a hash connection result of the data table to be connected and the target data table, the rows of data in the data table to be connected can be detected by the hash table at one time, that is, the rows of data in the data table to be connected can be detected by the hash table at one time, the efficiency of hash table detection on the row data in the data table to be connected can be effectively improved, and further, the performance of hash connection and the overall performance of a database using hash connection can be effectively improved.
It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present invention may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present invention.
The above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the hash join method described herein. Further, when a general-purpose computer accesses code for implementing the hash join method shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the hash join method shown herein.
Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The above embodiments are only for illustrating the embodiments of the present invention and not for limiting the embodiments of the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present invention, so that all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the scope of patent protection of the embodiments of the present invention should be defined by the claims.

Claims (11)

1. A hash join method, the method comprising:
determining hash entries respectively corresponding to a plurality of rows of data in a hash table of a target data table based on hash values of the plurality of rows of data in the data table to be connected;
determining hash buckets corresponding to the lines of data in the hash table respectively based on the hash entries;
and if the fact that the rows of data respectively meet the row data of the target data table stored in the corresponding hash bucket with the corresponding row data of the target data table is determined, performing hash connection on the rows of data and the row data of the target data table stored in the corresponding hash bucket to obtain a hash connection result of the data table to be connected and the target data table.
2. The method of claim 1, wherein the determining, based on hash values of a plurality of rows of data in the data tables to be connected, hash entries corresponding to the rows of data in the hash table of the target data table respectively comprises:
and retrieving hash entries respectively corresponding to each row of data in the hash table based on the hash value of each row of data connected with the primary key.
3. The method of claim 2, wherein the retrieving the hash entry corresponding to each row of data in the hash table based on the hash value of the connected primary key of each row of data comprises:
through a plurality of threads for calculating identification information of hash entries, based on hash values of connecting main keys of each row of data in the plurality of rows, the identification information of the hash entries respectively corresponding to each row of data in the hash table is calculated in parallel;
and determining the hash entry corresponding to each line of data in the hash table based on the identification information of the hash entry corresponding to each line of data in the hash table.
4. The method of claim 3, wherein the parallel computation of the identification information of each row of data in the hash table after the identification information of the corresponding hash entry is performed in parallel by a plurality of threads for computing the identification information of the hash entries based on the hash value of the connected primary key of each row of data in the plurality of rows, the method further comprises:
caching identification information of hash entries respectively corresponding to each row of data in the hash table;
the determining, based on the hash entry, hash buckets corresponding to the plurality of lines of data in the hash table respectively includes:
and searching a hash bucket corresponding to each row of data in the hash table in the hash entry identified by the cached identification information.
5. The method of claim 1, wherein the determining, based on the hash entry, respective hash buckets for the lines of data in the hash table comprises:
and searching hash buckets corresponding to each row of data in the hash table in the hash entry in parallel through a plurality of hash bucket searching threads.
6. The method of claim 1, wherein prior to the hashing the rows of data with the rows of data of the target data table stored in the corresponding hash bucket, the method further comprises:
for each row of data in the plurality of rows of data, if the hash value of the row of data connected with the main key is the same as the hash value of the row of data connected with the main key of the target data table stored in the corresponding hash bucket, determining whether the hash keys of the row of data are in one-to-one correspondence with the hash keys of the row of data of the target data table stored in the corresponding hash bucket;
and if the hash keys of the line data are determined to be in one-to-one correspondence with the hash keys of the line data of the target data table stored in the corresponding hash bucket, determining that the line data and the line data of the target data table stored in the corresponding hash bucket meet the equivalent connection condition.
7. The method of claim 1, wherein the hashing the rows of data with the rows of the target data table stored in the corresponding hash bucket comprises:
and performing hash connection on each line of data in the plurality of lines of data and the corresponding line of data of the target data table stored in the hash bucket in parallel through a plurality of hash connection threads.
8. A hash join apparatus, the apparatus comprising:
the first determining module is used for determining hash entries respectively corresponding to a plurality of rows of data in a hash table of a target data table based on hash values of the plurality of rows of data in the data table to be connected;
a second determining module, configured to determine, based on the hash entries, hash buckets corresponding to the multiple lines of data in the hash table, respectively;
and the hash connection module is used for performing hash connection on the plurality of lines of data and the corresponding line data of the target data table stored in the hash bucket to obtain a hash connection result of the data table to be connected and the target data table if the fact that the plurality of lines of data and the corresponding line data of the target data table stored in the hash bucket respectively meet the equivalent connection condition is determined.
9. A database system, the system comprising:
a client and a database server communicatively coupled to the client,
the client is used for receiving operation for indicating hash connection of the data table to be connected and sending a hash connection request for requesting the hash connection of the data table to be connected to the database server based on the operation;
the database server is configured to determine hash values of multiple rows of data in the data table to be connected based on the received hash connection request, determine hash entries corresponding to the multiple rows of data in a hash table of a target data table based on the hash values of the multiple rows of data in the data table to be connected, determine hash buckets corresponding to the multiple rows of data in the hash table based on the hash entries, and if it is determined that the multiple rows of data and the row data of the target data table stored in the corresponding hash bucket satisfy an equivalent connection condition, perform hash connection on the multiple rows of data and the row data of the target data table stored in the corresponding hash bucket to obtain a hash connection result between the data table to be connected and the target data table.
10. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the hash connection method in any one of claims 1-7.
11. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the hash join method of any one of claims 1 to 7.
CN202010623534.5A2020-06-292020-06-29Hash connection method, device and system, electronic equipment and computer storage mediumPendingCN113326258A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010623534.5ACN113326258A (en)2020-06-292020-06-29Hash connection method, device and system, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010623534.5ACN113326258A (en)2020-06-292020-06-29Hash connection method, device and system, electronic equipment and computer storage medium

Publications (1)

Publication NumberPublication Date
CN113326258Atrue CN113326258A (en)2021-08-31

Family

ID=77413043

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010623534.5APendingCN113326258A (en)2020-06-292020-06-29Hash connection method, device and system, electronic equipment and computer storage medium

Country Status (1)

CountryLink
CN (1)CN113326258A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115022110A (en)*2022-08-082022-09-06广州市千钧网络科技有限公司Message distribution method, readable medium and electronic device
CN115374142A (en)*2022-04-152022-11-22北京理工大学LSH-based direct query method for non-equivalent connectable data table

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104679757A (en)*2013-11-272015-06-03华为技术有限公司Data processing method and device
US20160275078A1 (en)*2015-03-202016-09-22International Business Machines CorporationParallel build of non-partitioned join hash tables and non-enforced n:1 join hash tables
CN110069496A (en)*2019-03-202019-07-30韶关学院A kind of Novel chain type Hash table construction method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104679757A (en)*2013-11-272015-06-03华为技术有限公司Data processing method and device
US20160275078A1 (en)*2015-03-202016-09-22International Business Machines CorporationParallel build of non-partitioned join hash tables and non-enforced n:1 join hash tables
CN110069496A (en)*2019-03-202019-07-30韶关学院A kind of Novel chain type Hash table construction method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SIOULAS, P ET.AL.: ""Hardware-conscious hash-joins on GPUs"", 《 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019)》, 13 August 2019 (2019-08-13), pages 698*
傅仁毅: "《数据库设计与性能优化》", 31 May 2011, 华中科技大学出版社, pages: 158 - 161*
方祝和: ""内存数据库中多表连接的性能优化"", 《中国博士学位论文全文数据库信息科技辑》, vol. 2019, no. 09, 15 September 2019 (2019-09-15), pages 137 - 4*

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115374142A (en)*2022-04-152022-11-22北京理工大学LSH-based direct query method for non-equivalent connectable data table
CN115022110A (en)*2022-08-082022-09-06广州市千钧网络科技有限公司Message distribution method, readable medium and electronic device

Similar Documents

PublicationPublication DateTitle
US8515964B2 (en)Method and system for fast similarity computation in high dimensional space
CN112084366B (en)Method, apparatus, device and storage medium for retrieving image
CN113495965B (en) A multimedia content retrieval method, device, equipment and storage medium
CN108959538B (en)Full text retrieval system and method
US8032550B2 (en)Federated document search by keywords
CN110175174A (en)A kind of data query method, apparatus, equipment and storage medium
CN113326258A (en)Hash connection method, device and system, electronic equipment and computer storage medium
WO2016070750A1 (en)Distributed buffering range querying method, device, and system
US11886394B2 (en)Composable query language gateway routing protocol
CN110597808A (en) Distributed database table connection method, device, system, server and medium
CN113535781A (en)Data query method, device, equipment and storage medium of time sequence library
WO2022006794A1 (en)Routing directives for partitioned databases
Deng et al.Spatial-keyword skyline publish/subscribe query processing over distributed sliding window streaming data
CN105574010B (en)Data query method and device
CN111241137A (en)Data processing method and device, electronic equipment and storage medium
WO2020068241A1 (en)Fashion by trend user interfaces
CN111464451B (en)Data stream equivalent connection optimization method and system and electronic equipment
CN103891244B (en) A method and device for data storage and retrieval
WO2025077303A1 (en)Information processing method and apparatus, and device and storage medium
HK40058031A (en)Hash join method, device and system, electronic equipment and computer storage medium
CN110109919B (en)Method and device for determining logic information
CN113868533B (en) Application search method, device, electronic device and storage medium
CN116361340A (en) Detail inquiry method, device, electronic equipment and storage medium
CN115221167A (en) Static data storage and query method, device and electronic device
CN112699149A (en)Target data acquisition method and device, storage medium and electronic device

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
REGReference to a national code

Ref country code:HK

Ref legal event code:DE

Ref document number:40058031

Country of ref document:HK


[8]ページ先頭

©2009-2025 Movatter.jp