CN120045587A

Movatterモバイル変換

Info

Publication number: CN120045587A
Application number: CN202510171181.2A
Authority: CN
Inventors: 李伟亮; 陈彬; 林志达; 张喜铭; 林克全; 石刚; 徐欢; 杨航; 王钦洲
Original assignee: China Southern Power Grid Co Ltd
Current assignee: China Southern Power Grid Co Ltd
Priority date: 2025-02-17
Filing date: 2025-02-17
Publication date: 2025-05-27

Abstract

A data query method, device, equipment, storage medium and program product of a distributed database comprise the steps of constructing a query plan evaluation model based on data transmission delay between every two working nodes in the distributed cluster, computing power resources of each working node and data storage layout information, calling the query plan evaluation model under the condition that a query request is received, respectively evaluating a plurality of query plans generated based on the query request, determining a target query plan from the plurality of query plans based on evaluation results, determining at least one target working node where data to be queried is located in the distributed cluster, pushing down each query operator contained in the target query plan to each target working node where the data to be queried is located based on a preset pushing down strategy, and enabling each target working node to execute each query operator based on the target query plan to obtain the data query result. By adopting the method, the data query efficiency of the distributed database can be improved.

Description

Data query method, device, equipment, storage medium and program product for distributed database

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for querying data in a distributed database.

Background

With the rapid development of information technology, the data volume is increased explosively, and conventional databases face many challenges in processing large-scale data, so that distributed databases have been developed. However, the distributed database has the problems that the data query speed is low in the query process due to the fact that data are distributed on a plurality of nodes and data interaction among the nodes.

Therefore, how to improve the data query efficiency of the distributed database is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a data query method, a device, equipment, a storage medium and a program product of a distributed database, which can improve the data query efficiency of the distributed database.

In a first aspect, an embodiment of the present application provides a data query method of a distributed database, which is applied to a database management node in a distributed cluster deployed with the distributed database, where the method includes:

Constructing a query plan evaluation model based on data transmission delay between every two working nodes in a plurality of working nodes included in the distributed cluster, computing power resources of each working node and data storage layout information;

Under the condition of receiving a query request, a query plan evaluation model is called, a plurality of query plans generated based on the query request are evaluated respectively, and a target query plan is determined from the plurality of query plans based on an evaluation result;

Determining at least one target working node where data to be queried carried by a query request are located in a distributed cluster;

based on a preset pushing strategy, pushing down each query operator contained in the target query plan to each target working node where the data to be queried is located, so that each target working node executes each query operator based on the target query plan, and obtaining a data query result.

In one embodiment, the target query plan is determined from the multiple query plans based on the evaluation result, and the target query plan is determined from the multiple query plans based on the evaluation result, wherein the query plan is the target query plan, the cost corresponding to the execution sequence of the multiple query operators is minimum, the data transmission quantity corresponding to the multiple target working nodes is minimum, and the computing power resources required by the query task distributed to each target working node are smaller than those of the target working nodes.

In one embodiment, the computing power resource of each working node is determined by determining hardware configuration information of each working node in a plurality of working nodes, wherein the hardware configuration information comprises a central processing unit core number and content capacity, weighting each item of information included in the hardware configuration information of each working node based on a preset computing power resource evaluation strategy, and obtaining the computing power resource of each working node based on each item of weighted information.

In one embodiment, the method further comprises the steps of detecting the running state of each target working node in real time, and distributing query tasks corresponding to the fault target working nodes to any other target working node except the fault target working node in at least one target working node when the fault target working node is detected.

In one embodiment, the method further comprises responding to the query request, analyzing query sentences carried in the query request by utilizing a grammar analyzer to obtain a target abstract grammar tree, extracting a plurality of query operators contained in the query sentences from the target abstract grammar tree, and generating a plurality of query plans based on the plurality of query operators contained in the query sentences.

In one embodiment, responding to the query request, analyzing the query statement carried in the query request to obtain a target abstract syntax tree, wherein the method comprises the steps of responding to the query request, carrying out lexical analysis on the query statement carried in the query request, decomposing the query statement based on a lexical analysis result to obtain a decomposed query statement, carrying out syntax analysis on the decomposed query statement, and constructing the target abstract syntax tree based on the syntax analysis result.

In a second aspect, the present application provides a data query device of a distributed database, applied to a database management node in a distributed cluster deployed with the distributed database, where the device includes:

The construction module is used for constructing a query plan evaluation model based on data transmission delay between every two working nodes in the plurality of working nodes included in the distributed cluster, computing power resources of each working node and data storage layout information;

the evaluation module is used for calling a query plan evaluation model under the condition of receiving a query request, respectively evaluating a plurality of query plans generated based on the query request, and determining a target query plan from the plurality of query plans based on evaluation results;

the determining module is used for determining at least one target working node where the data to be queried carried by the query request is located in the distributed cluster;

and the processing module is used for pushing down each query operator contained in the target query plan to each target working node where the data to be queried is located based on a preset pushing down strategy, so that each target working node executes each query operator based on the target query plan, and a data query result is obtained.

In a third aspect, the present application provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:

The data query method, the device, the equipment, the storage medium and the program product of the distributed database are applied to database management nodes in a distributed cluster where the distributed database is deployed, the database management nodes can construct a query plan evaluation model based on data transmission delay between every two working nodes in a plurality of working nodes included in the distributed cluster, calculation power resources of each working node and data storage layout information, the query plan evaluation model is called under the condition that a query request is received, a plurality of query plans generated based on the query request are evaluated respectively, a target query plan is determined from the plurality of query plans based on evaluation results, the target query plan comprises a plurality of query operators, at least one target working node where query data to be queried carried by the query request is located in the distributed cluster is determined, and each query operator contained in the target query plan is pushed down to each target working node where the query data to be queried is located based on a preset push-down strategy, so that each target working node executes each query operator based on the target query plan to obtain the data query result. According to the method, the construction of the query plan evaluation model is related to the data transmission time delay between every two working nodes in the distributed cluster, the computing power resource of each working node and the data storage layout information, so that a plurality of query plans generated based on query requests are evaluated respectively based on the query plan evaluation model, a target query plan with smaller data transmission time delay can be determined from the query plans based on evaluation results, the influence of the data transmission time delay on query performance can be reduced, the data query efficiency is improved, and then each query operator contained in the target query plan with smaller data transmission time delay is pushed down to each target working node where data to be queried is located based on a preset push strategy, so that a large number of unnecessary data transmission can be reduced, the network load is reduced, and the data query efficiency is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are needed in the description of the embodiments of the present application or the related technologies will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

Fig. 1 is an application scenario schematic diagram of a data query method of a distributed database according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for querying data in a distributed database according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another method for querying data in a distributed database according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a data query device of a distributed database according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The application scenario of the data query method of the distributed database provided by the embodiment of the application is described below.

Referring to fig. 1, fig. 1 is a schematic application scenario diagram of a data query method of a distributed database according to an embodiment of the present application. As shown in fig. 1, the distributed cluster 100 includes a database management node 101 and a plurality of working nodes (the plurality of working nodes including a target working node 102, a target working node 103, and a working node 104 are drawn in fig. 1 as an example).

The database management node 101 may construct a query plan evaluation model based on data transmission delay between every two working nodes in the distributed cluster, computing power resources of each working node and data storage layout information, call the query plan evaluation model under the condition of receiving a query request, evaluate a plurality of query plans generated based on the query request respectively, and determine a target query plan from the plurality of query plans based on evaluation results, wherein the target query plan comprises a plurality of query operators, determine at least one target working node, such as the target working node 102 and the target working node 103, where data to be queried carried by the query request is located in the distributed cluster, and push down each query operator contained in the target query plan to each target working node, namely the target working node 102 and the target working node 103, where the data to be queried is located based on a preset push-down strategy, so that each query operator is executed by the target working node 102 and the target working node 103 based on the target query plan, and a data query result is obtained. According to the method, the construction of the query plan evaluation model is related to the data transmission time delay between every two working nodes in the distributed cluster, the computing power resource of each working node and the data storage layout information, so that a plurality of query plans generated based on query requests are evaluated respectively based on the query plan evaluation model, a target query plan with smaller data transmission time delay can be determined from the query plans based on evaluation results, the influence of the data transmission time delay on query performance can be reduced, the data query efficiency is improved, and then each query operator contained in the target query plan with smaller data transmission time delay is pushed down to each target working node where data to be queried is located based on a preset push strategy, so that a large number of unnecessary data transmission can be reduced, the network load is reduced, and the data query efficiency is further improved.

Alternatively, both the database management node 101 and the plurality of working nodes may be servers. The servers mentioned herein may be independent physical servers, or may be a server cluster or a distributed system formed by a plurality of physical servers.

Referring to fig. 2, fig. 2 is a flowchart illustrating a data query method of a distributed database according to an embodiment of the present application. The method may be performed by a database management node in a distributed cluster, such as database management node 101 described above. As shown in fig. 2, the data query method of the distributed database may include, but is not limited to, the following steps:

s201, constructing a query plan evaluation model based on data transmission delay between every two working nodes in a plurality of working nodes included in the distributed cluster, computing power resources of each working node and data storage layout information.

The data transmission delay is one of important factors affecting the query performance of the distributed database, and can be determined based on a time measurement method in the network communication principle.

Alternatively, the data transmission delay between each two working nodes may be determined by the database management node sending indication information to each working node, where the indication information is used to instruct the working node to send test data packets to other working nodes and measure round trip time, and receiving the data transmission delay between each two working nodes returned from each working node. In this way, by acquiring accurate data transmission delay, when determining the target query plan, node combinations with long data transmission paths and high delay can be avoided from being selected, so that data transmission waiting time is reduced.

For example, the database management system may send test packets of a specific size (e.g., 1KB, 1MB, etc.) from one worker node (denoted as the sender node) to other worker nodes, periodically or under specific conditions (e.g., node state changes, network configuration adjustments, etc.), while recording the time of transmission. The other working nodes (marked as receiving nodes) return an acknowledgement packet immediately after receiving the data packet, and the sending node records the receiving time when receiving the acknowledgement packet. Round-Trip Time (RTT) is the Time of reception minus the Time of transmission, and multiple measurements are averaged to obtain an estimate of the network propagation delay. For example, in a distributed database comprising three nodes, node A, nodeB, nodeC, the NodeB sends test packets to the nodebs and NodeC, and the average transmission delay between nodebs is measured and calculated to be 5ms and between nodebs and NodeC is 8ms.

Optionally, when data transmission is performed between every two working nodes, a custom communication protocol may be used for data transmission. Wherein the custom high-efficiency communication protocol combines data compression and data encryption techniques. The data compression technology is to compress the transmitted data through a specific algorithm (such as huffman coding and the like) based on the redundancy and regularity of the data so as to reduce the data volume and reduce the network transmission bandwidth requirement. The data encryption technology (such as symmetric encryption or asymmetric encryption algorithm) can ensure the security of the data in the transmission process and prevent the data from being stolen or tampered. Thus, the data transmission amount can be reduced, the data transmission efficiency can be improved, and the data transmission safety can be improved.

For example, assuming that data needs to be transmitted between the NodeB and the NodeB, the NodeB may first compress the data to be transmitted using a data compression algorithm. For example, for a data block containing a large amount of repeated data or having a certain pattern, the huffman coding algorithm is adopted to convert the data block into a more compact coding form, so that the storage space and the transmission quantity of the data are reduced. Then, the compressed data is encrypted by using an encryption algorithm, so that confidentiality of the data is ensured. After receiving the compressed and encrypted data from the NodeA, the NodeB can firstly perform decryption operation, recover the compressed data, and then perform decompression operation to obtain the original data.

The computing power resource of each working node can reflect the capability of the working node to process data, so that the accuracy of an evaluation result obtained by the query plan evaluation model for the query plan can be improved by introducing data storage layout information when the query plan evaluation model is constructed.

Alternatively, the computational power resources of each working node may be determined based on the hardware performance metrics of that working node. The hardware performance index may include, but is not limited to, a central processing unit (Central Processing Unit, CPU) core number, a content capacity, and the like. In this way, by determining the computational power resources of each of the work nodes, it may be advantageous to distribute computationally intensive tasks to computationally intensive nodes, thereby improving computational efficiency.

Because the data storage layout information relates to a database storage principle, different data storage layouts have different effects on data reading and processing efficiency, and therefore, the accuracy of an evaluation result obtained by the query plan evaluation model for the query plan can be improved by introducing the data storage layout information when the query plan evaluation model is constructed.

Wherein the data storage layout information includes row storage and column storage. It may be determined by the data storage layout information whether the data is organized on the storage medium in a row or column storage. The row storage is to store data continuously according to rows, and is suitable for reading and writing of whole row data under the transaction processing scene, the column storage is to store data continuously according to columns, and the batch reading efficiency of column data is higher under the scenes of data statistical analysis and the like. For example, for a data warehouse application scenario, column storage layout may be more advantageous to improve query performance if column-based data analysis queries are performed frequently.

S202, under the condition that a query request is received, a query plan evaluation model is called, a plurality of query plans generated based on the query request are evaluated respectively, and a target query plan is determined from the plurality of query plans based on evaluation results.

Wherein the target query plan contains a plurality of query operators.

Because the constructed query plan evaluation model comprehensively considers multiple factors such as network transmission delay, node computing power resources (or computing power), data storage layout and the like, the constructed query plan evaluation model can accurately evaluate the cost of different query plans. For example, due to limited network bandwidth and high transmission delay between the NodeB and the NodeB, and the high computing power of the NodeB, the cost model may tend to choose an execution query plan that reduces the transmission of data across the nodes and fully utilizes the computing power of the NodeB.

S203, determining at least one target working node where the data to be queried carried by the query request is located in the distributed cluster.

In an alternative implementation mode, the database management node determines at least one target working node where the data to be queried carried by the query request is located in the distributed cluster, and the method can comprise the steps of determining the identification of the data to be queried included in the query statement, and determining at least one target working node where the data to be queried carried by the query request is located in the distributed cluster based on the identification of the data to be queried and the data distribution information of the distributed database, wherein each target working node is at least one working node in the distributed cluster.

In some embodiments, the data distribution information of the distributed database may be determined by the database management node based on metadata of the distributed database. Alternatively, the database management node may read information about the distribution of the table data from the metadata storage area of the distributed database, and determine the data distribution information of the distributed database based on the distribution information of the table data.

Wherein the metadata of the database is stored in a specific data structure (e.g., a record containing the start address, end address, belonging table, and storing the identity of the working node, etc.). Therefore, by recording key information such as the starting address, the ending address, the belonging table, the storage node identification and the like of the data block, the position of the data, namely at least one target working node where the data is located, can be accurately judged in the query optimization process, and further reasonable query execution plans can be formulated according to the data distribution condition, such as pushing operators related to the data on a specific node down to the node for execution, blind data transmission is avoided, and network overhead is reduced. Meanwhile, load balancing is facilitated, calculation tasks are reasonably distributed to all nodes, and the resource utilization rate of the whole distributed database system is improved.

For example, when the query involves table "table1", the system can quickly acquire the data blocks of "table1" distributed on which working nodes and the specific address range of each data block by querying metadata, so as to provide accurate data distribution basis for subsequent query optimization.

S204, pushing down each query operator contained in the target query plan to each target working node where the data to be queried is located based on a preset pushing down strategy, so that each target working node executes each query operator based on the target query plan, and a data query result is obtained.

Alternatively, the query operator may include, but is not limited to, a screening operator, a join operator, an aggregation operator, a projection operator, and the like. The method comprises the steps of selecting a data meeting specific conditions, selecting a connection operator, a data aggregation operator, a projection operator and a calculation operator, wherein the selection operator can be used for selecting the data meeting specific conditions, the connection operator can be used for associating the data of different tables, the aggregation operator can be used for carrying out statistical calculation and grouping on the data, and the projection operator can be used for determining columns contained in a final returned result set.

Exemplary, after pushing down the screening operator to the node a and the node b, an independent screening process is started on each node, local data is screened according to the condition of "table1.Value >5", and the screening result is stored in the node local memory. The NodeB starts a connection process, performs connection operation from local table2 data and filtered table1 data transmitted from the NodeA according to a connection condition of 'table 1. Key=table 2. Key', and the connection result is also stored in a local memory. Finally, the NodeB starts the aggregation process, performs the "SUM (column 2) GROUPBYcolumn1" operation, combines the aggregation result with the "column1" into a final result set, and returns the result set to the client through a network communication protocol (e.g., a custom high-efficiency communication protocol). Thus, by pushing down the screening operator to the target working node where the data is located, a large amount of unnecessary data transmission is reduced, and the network load is reduced. In addition, the connection and aggregation operation is carried out on a small amount of data after transmission screening, so that the query execution efficiency can be remarkably improved, and the query response time can be reduced. Meanwhile, the stronger computing power of the NodeB is reasonably utilized for connection and aggregation operation, so that the waste of node computing resources can be avoided, and the resource utilization rate of the whole distributed database system is improved.

In the embodiment of the application, a database management node can construct a query plan evaluation model based on data transmission delay between every two working nodes in a plurality of working nodes included in a distributed cluster, computing power resources of each working node and data storage layout information, under the condition of receiving a query request, the query plan evaluation model is called to evaluate a plurality of query plans generated based on the query request respectively, a target query plan is determined from the plurality of query plans based on evaluation results, the target query plan comprises a plurality of query operators, at least one target working node where data to be queried carried by the query request is located in the distributed cluster is determined, each query operator contained in the target query plan is pushed down to each target working node where the data to be queried is located based on a preset push-down strategy, so that each target working node executes each query operator based on the target query plan to obtain a data query result. According to the method, the construction of the query plan evaluation model is related to the data transmission time delay between every two working nodes in the distributed cluster, the computing power resource of each working node and the data storage layout information, so that a plurality of query plans generated based on query requests are evaluated respectively based on the query plan evaluation model, a target query plan with smaller data transmission time delay can be determined from the query plans based on evaluation results, the influence of the data transmission time delay on query performance can be reduced, the data query efficiency is improved, and then each query operator contained in the target query plan with smaller data transmission time delay is pushed down to each target working node where data to be queried is located based on a preset push strategy, so that a large number of unnecessary data transmission can be reduced, the network load is reduced, and the data query efficiency is further improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating another data query method of a distributed database according to an embodiment of the present application. The difference from the data query method of the distributed database shown in fig. 2 is that the method shown in fig. 3 also illustrates how a plurality of query plans are generated. As shown in fig. 3, the data query method of the distributed database may include, but is not limited to, the following steps:

S301, constructing a query plan evaluation model based on data transmission delay between every two working nodes in a plurality of working nodes included in the distributed cluster, computing power resources of each working node and data storage layout information.

In an alternative embodiment, the description of step S301 may be referred to in the foregoing description of step S201, and will not be repeated here.

S302, under the condition that a query request is received, responding to the query request, and analyzing query sentences carried in the query request by utilizing a grammar analyzer to obtain a target abstract grammar tree.

In an alternative embodiment, the database management node responds to the query request and analyzes the query statement carried in the query request by using a grammar analyzer to obtain a target abstract grammar tree, and the method can comprise the steps of responding to the query request, performing lexical analysis on the query statement carried in the query request, decomposing the query statement based on a lexical analysis result to obtain a decomposed query statement, performing grammar analysis on the decomposed query statement, and constructing the target abstract grammar tree based on the grammar analysis result.

For example, for a complex query statement, for example "SELECT column1, SUM (column2) FROM table1 JOIN table2 ON table1.key = table2.key WHERE table1.value > 5 GROUPBY column1",, the complex query statement may be gradually decomposed into various syntax elements according to a predetermined syntax rule by using a syntax analyzer, and a target abstract syntax tree is constructed, where each node represents a syntax structure, for example, clauses such as SELECT, FROM, WHERE, and leaf nodes are specific column names, table names, constant values, and the like.

S303, extracting a plurality of query operators included in the query statement from the target abstract syntax tree, and generating a plurality of query plans based on the plurality of query operators included in the query statement.

Alternatively, the plurality of query operators may include, but are not limited to, a screening operator, a join operator, an aggregation operator, a projection operator, and the like.

Illustratively, following the example in step S302, for example, a "WHERE" key is typically used to introduce a filter term corresponding to the filter operator, a "JOIN" key is used for the table JOIN operation corresponding to the JOIN operator, an aggregate function such as "SUM", "COUNT" and the like, and a "GROUPBY" key is used for the data aggregation operation corresponding to the aggregate operator, and a column specified after the "SELECT" key corresponds to the projector operator. In this way, through identifying a plurality of query operators in the query statement, the intention and the operation flow of the query can be clarified, and a basis is provided for the subsequent determination of the target query plan.

S304, a query plan evaluation model is called, a plurality of query plans are evaluated respectively, and a target query plan is determined from the plurality of query plans based on an evaluation result.

Wherein the target query plan contains a plurality of query operators.

In an alternative implementation, the database management node determines a target query plan from the multiple query plans based on the evaluation result, and the method can include determining a query plan with minimum cost corresponding to the execution sequence of the multiple query operators, minimum data transmission quantity corresponding to the multiple target working nodes and less calculation power resources required by the query task distributed to each target working node than the calculation power resources of the target working node from the multiple query plans based on the evaluation result.

The cost corresponding to the execution sequence of the plurality of query operators is minimum, and the cost of calculation and data transmission is reduced. For example, in a query statement including a filter operator (WHERE), a JOIN Operator (JOIN), and an aggregate operator (group), such as "SELECT COUNT (*) FROM table1 JOIN table2 ON table1.key = table2.key WHERE table1.value > 10 GROUP BY table1.column1", a filtering "table1.Value > 10" is performed at a node WHERE data is located, and then JOIN and aggregate operations are performed, so that the amount of data involved in JOIN and aggregate can be greatly reduced, and thus the computation and transmission costs can be reduced, compared to the case of JOIN-before-filter.

The data transmission quantity corresponding to the target working nodes is minimum, and the data transmission delay is reduced. For example, in a distributed database, there are NodeA, nodeB, nodeC nodes, the data of table1 is distributed between the node a and the node b, the data of table2 is mainly NodeC, and the query involves the connection operation of table1 and table 2. If the network transmission delay between the nodebs and NodeC is high and the network transmission delay between the nodebs and NodeC is low, then the data on the NodeB is selected to be transmitted to NodeC for the connection operation instead of from the NodeB, which may reduce the data transmission delay.

The computing power resources required by the query task distributed to each target working node are smaller than those of the target working node, so that the computing efficiency is improved. For example, for a query containing complex computations, such as an aggregate operation involving multiple function nests and large amounts of data, if the computational resources of the NodeA are more than those of the NodeB, the database management node may allocate these complex computing tasks to be performed on the NodeA, which may then perform relatively simple tasks, such as data screening, etc., to increase overall computational efficiency.

S305, determining at least one target working node where data to be queried carried by the query request is located in the distributed cluster.

S306, pushing down each query operator contained in the target query plan to each target working node where the data to be queried is located based on a preset pushing down strategy, so that each target working node executes each query operator based on the target query plan, and a data query result is obtained.

In an alternative embodiment, the descriptions of steps S304 to S306 may be referred to the descriptions of steps S202 to S204, respectively, and will not be repeated here.

In the embodiment of the application, a database management node can respond to a query request, analyze query sentences carried in the query request by utilizing a grammar analyzer to obtain a target abstract grammar tree under the condition of receiving the query request, identify a plurality of query operators contained in the query sentences carried by the query request based on the target abstract grammar tree, thereby definitely inquiring intention and determining a plurality of query plans, evaluate the plurality of query plans respectively based on a query plan evaluation model, and determine a target query plan with smaller data transmission delay from the plurality of query plans based on evaluation results, thus, the influence of the data transmission delay on query performance can be reduced, thereby improving the data query efficiency, and then push down each query operator contained in the target query plan with smaller data transmission delay to each target working node where data to be queried is located based on a preset push-down strategy, thus, a large amount of unnecessary data transmission can be reduced by pushing down each query operator to each target working node where the data to be queried is located, and network load is reduced, thereby further improving the data query efficiency.

In an alternative implementation manner, in the data query method of the distributed database shown in fig. 2 and fig. 3, the computing power resource of each working node may be determined by the database management node by determining hardware configuration information of each working node in a plurality of working nodes, where the hardware configuration information includes a central processing unit core number and a content capacity, weighting each item of information included in the hardware configuration information of each working node based on a preset computing power resource evaluation policy, and obtaining the computing power resource of each working node based on each item of weighted information.

Optionally, when the database management node performs weighting processing on each item of information included in the hardware configuration information of each working node based on a preset computing power resource evaluation policy, and obtains computing power resources of each working node based on each item of weighted information, the computing power resources=the number of CPU cores, the main frequency coefficient+the memory capacity, and the memory coefficient.

By adopting the embodiment, the computing power resource of each working node can be accurately determined based on the hardware configuration information of each working node.

In an alternative embodiment, in the data query method of the distributed database shown in fig. 2 and fig. 3, the database management node may further detect an operation state of each target working node in real time, and if a fault target working node is detected, assign a query task corresponding to the fault target working node to any other target working node except the fault target working node in at least one target working node.

Alternatively, the operational status of each target working node may include, but is not limited to, CPU usage, memory usage, network connection status, and the like.

By adopting the implementation mode, the continuity of data query can be ensured, the reliability and fault tolerance of the data query can be improved, and the data loss and query failure can be avoided.

In an alternative embodiment, in the data query method of the distributed database shown in fig. 2 and 3, each target working node may execute the query operator in a multithreaded parallel processing manner when receiving the pushed query operator.

Optionally, each target working node executes the query operator in a multithreading parallel processing mode under the condition of receiving the pushed query operator, and the received query operator can be decomposed into a plurality of subtasks, each subtask is associated with an independent thread, and the plurality of threads are operated in parallel to execute the plurality of subtasks.

Illustratively, when the filtering operator is pushed down to a target working node (e.g., a node a) for execution, the filtering task is divided into 4 or more subtasks according to the number of data blocks or the data distribution condition, assuming that the node a has 4 CPU cores (e.g., 4 regions according to the distribution of the data blocks on the storage medium, each region corresponding to one subtask). Each subtask is handled by a separate thread, which runs in parallel on the CPU core of the nodeb. For example, for the screening condition "column1>10," thread 1 is responsible for processing data screening in data block 1, thread 2 is responsible for processing data screening in data block 2, and so on. The threads are coordinated through a shared memory or other synchronous mechanisms (such as semaphores) to ensure the correctness and the integrity of data processing.

By adopting the implementation mode, based on the multithread concurrent execution principle in the operating system, a plurality of threads can share CPU resources at the same time and process different data blocks in parallel, so that the processing time is reduced and the local data processing speed is improved. In a multi-core CPU environment, the processing time can be reduced to the original 1/core number or even shorter (considering thread creation and scheduling overhead) compared to single-thread processing. For example, on a node of a 4-core CPU, the processing speed may be increased by 2-3 times by using a multithread parallel processing screening operator, so that the execution speed of the whole query is increased, and especially when large-scale data is processed, the gain is more obvious, and the query performance of a distributed database system can be effectively improved.

In an alternative embodiment, in the data query method of the distributed database shown in fig. 3, the query statement carried in the query request may include a main query statement and a sub query statement. In this case, the database management node responds to the query request and analyzes the query statement carried in the query request by using the grammar analyzer to obtain a target abstract syntax tree, and the method can include respectively performing lexical analysis on a main query statement and a sub-query statement included in the query statement in the query request by using the grammar analyzer to obtain a first lexical analysis result corresponding to the main query statement and a second lexical analysis result corresponding to the sub-query statement, decomposing the main query statement based on the first lexical analysis result to obtain a decomposed main query statement, and decomposing the sub-query statement based on the second lexical analysis result to obtain a decomposed sub-query statement, performing syntax analysis on the decomposed main query statement, constructing a first abstract syntax tree corresponding to the main query statement, performing syntax analysis on the decomposed sub-query statement based on the syntax analysis result, and constructing a second abstract syntax tree corresponding to the sub-query statement, and obtaining the target abstract syntax tree based on the first abstract syntax tree and the second abstract syntax tree.

IN the following, taking an example that the query statement is "SELECT FROM table3 WHERE column3 local '% abc%' AND column4 IN (SELECT column5 FROM table 4)", AND the plurality of working nodes included IN the distributed cluster include nodeb, nodeb AND NodeE as an example, an overall process of the data query method of the distributed database provided by the embodiment of the present application is described. The selection column5 FROM table4 is a sub-query statement, and is used for obtaining an intermediate result set, and then further screening and association operations are performed in the main query statement according to the intermediate result set.

Firstly, the database management node can construct a query plan evaluation model based on data transmission delay between every two nodes of the NodeC, the NodeD and NodeE, computing power resources of each node and data storage layout (such as storage format, data compression mode and the like). Under the condition that a query request is received, a grammar analyzer is utilized to respectively perform lexical analysis on a main query sentence and a sub-query sentence included in a query sentence in the query request to obtain a first lexical analysis result corresponding to the main query sentence and a second lexical analysis result corresponding to the sub-query sentence. Decomposing the main query sentence based on the first lexical analysis result to obtain a decomposed main query sentence, and decomposing the sub-query sentence based on the second lexical analysis result to obtain a decomposed sub-query sentence. The method comprises the steps of carrying out grammar analysis on a main query sentence after decomposition, constructing a first abstract grammar tree corresponding to the main query sentence based on a grammar analysis result, carrying out grammar analysis on a sub query sentence after decomposition, and constructing a second abstract grammar tree corresponding to the sub query sentence based on a grammar analysis result. And obtaining a target abstract syntax tree based on the first abstract syntax tree and the second abstract syntax tree.

Second, the database management node may extract a main query operator and a sub-query operator included in the query statement from the target abstract syntax tree, and generate a plurality of query plans based on the main query operator and the sub-query operator. For example, LIKE and IN conditions are filter operators IN the main query operator, used for filtering data according to specific conditions IN the main query, and sub-query operators are used for embedding sub-queries IN the main query to obtain more complex query results.

And under the condition that the query request is received, a query plan evaluation model is called, a plurality of query plans are evaluated respectively, and a target query plan is determined from the plurality of query plans based on an evaluation result. For example, the target query is formulated as if sub-queries were performed and the results were cached, then the screening operation in the main query was performed, and finally the join operation was performed.

Then, the database management node can acquire the target working nodes where the table3 and the table4 are located from the data distribution information corresponding to the database metadata. For example, data for Table3 is distributed at NodeC and NodeD, and data for Table4 is distributed at NodeE.

Then, the database management node may push down the sub-query operators to NodeE, push down the filter operators in the main query operator to NodeC and NodeD, and push down the join operators in the main query operator to NodeD based on a preset push down strategy, so that the node c, the node d, and the node NodeE execute each query operator based on the target query plan (i.e., execute the sub-query and buffer the result first, then execute the filter operation in the main query, and finally perform the join operation), to obtain the data query result. That is NodeE first executes the sub-query "SELECT column5 FROM table4" and stores the results in the local cache. Then NodeC AND NodeD respectively start a screening process, screen the local table3 data according to the condition of 'column 3 LIKE'% abc% 'AND column4 IN', AND the screened result is stored IN the local memory. Finally, nodeD starts the connection process, obtains the sub-query result from NodeE, and performs connection operation with the result after local screening to obtain the data query result. Optionally, nodeE may also return the qualified data query result set to the database management node over the network.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data query device of the distributed database for realizing the data query method of the distributed database. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the data query device for one or more distributed databases provided below may refer to the limitation of the data query method for the distributed database hereinabove, and will not be repeated herein.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a data query device of a distributed database according to an embodiment of the present application. As shown in fig. 4, the apparatus is applied to a database management node in a distributed cluster deployed with a distributed database, and may include, but is not limited to:

a construction module 401, configured to construct a query plan evaluation model based on a data transmission delay between each two working nodes in the plurality of working nodes included in the distributed cluster, a computing power resource of each working node, and data storage layout information;

The evaluation module 402 is configured to invoke a query plan evaluation model, evaluate a plurality of query plans generated based on the query request, and determine a target query plan from the plurality of query plans based on an evaluation result, where the target query plan includes a plurality of query operators;

a determining module 403, configured to determine at least one target working node where data to be queried carried by the query request is located in the distributed cluster;

And the processing module 404 is configured to push down each query operator included in the target query plan to each target working node where the data to be queried is located based on a preset push-down policy, so that each target working node executes each query operator based on the target query plan, and a data query result is obtained.

In one embodiment, the evaluation module 402 is specifically configured to determine, from the plurality of query plans, a query plan that meets a condition that an execution order of the plurality of query operators corresponds to a minimum cost, that a data transmission amount corresponding to the plurality of target working nodes is a minimum, and that a computational effort resource required for a query task allocated to each target working node is smaller than a computational effort resource of the target working node, as the target query plan, when determining the target query plan from the plurality of query plans based on the evaluation result.

In one embodiment, the determining module 402 is further configured to determine hardware configuration information of each of the plurality of working nodes, where the hardware configuration information includes a central processor core number and a content capacity, weight each item of information included in the hardware configuration information of each working node based on a preset computing power resource evaluation policy, and obtain computing power resources of each working node based on each weighted item of information.

In one embodiment, the device further comprises a detection module, wherein the detection module is used for detecting the running state of each target working node in real time, and if the fault target working node is detected, the query task corresponding to the fault target working node is distributed to any other target working node except the fault target working node in at least one target working node.

In one embodiment, the apparatus further comprises a parsing module, an extracting module, and a generating module. The system comprises a query request, a parsing module, an extraction module and a generation module, wherein the parsing module is used for responding to the query request, parsing query sentences carried in the query request by using a grammar parser to obtain a target abstract grammar tree, the extraction module is used for extracting a plurality of query operators included in the query sentences from the target abstract grammar tree, and the generation module is used for generating a plurality of query plans based on the plurality of query operators included in the query sentences.

In one embodiment, the parsing module is used for parsing the query statement carried in the query request to obtain a target abstract syntax tree in response to the query request, and is specifically used for performing lexical analysis on the query statement carried in the query request, decomposing the query statement based on a lexical analysis result to obtain a decomposed query statement, performing syntax analysis on the decomposed query statement, and constructing the target abstract syntax tree based on the syntax analysis result.

The modules in the data query device of the distributed database may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a processor in the terminal device, or may be stored in software in a memory in the terminal device, so that the processor may call and execute operations corresponding to the above modules.

In an exemplary embodiment, an embodiment of the present application provides a computer device, which may be a server, and an internal structure diagram thereof may be as shown in fig. 5. The computer device includes a processor, a memory, an input/output interface, a communication interface, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The Communication interface of the computer device is used for conducting wired or wireless Communication with an external terminal, and the wireless Communication can be realized through WIFI, a mobile cellular network, near field Communication (NEAR FIELD Communication) or other technologies. The computer program, when executed by a processor, implements a method for querying data of a distributed database.

It will be appreciated by those skilled in the art that the structure shown in FIG. 5 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one exemplary embodiment, the application provides a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps in the data query method of the distributed database described above when the computer program is executed.

In one exemplary embodiment, the application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data query method of a distributed database described above.

In an exemplary embodiment, the application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps in the data query method of a distributed database described above.

It should be noted that, the data related to the present application (including, but not limited to, the data transmission delay between every two working nodes, the computing power resource and data storage layout information of each working node, the query request, the target query plan, etc.) are all information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile memory and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (RESISTIVE RANDOM ACCESS MEMORY, reRAM), magneto-resistive Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computation, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present application.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. The data query method of the distributed database is characterized by being applied to a database management node in a distributed cluster where the distributed database is deployed, and comprises the following steps of:

Under the condition of receiving a query request, calling the query plan evaluation model, respectively evaluating a plurality of query plans generated based on the query request, and determining a target query plan from the plurality of query plans based on evaluation results, wherein the target query plan comprises a plurality of query operators;

determining at least one target working node of data to be queried carried by the query request in the distributed cluster;

Based on a preset pushing strategy, pushing down each query operator contained in the target query plan to each target working node where the data to be queried are located, so that each target working node executes each query operator based on the target query plan, and a data query result is obtained.

2. The method of claim 1, wherein the determining a target query plan from a plurality of the query plans based on the evaluation result comprises:

based on the evaluation result, determining a query plan satisfying the following conditions from a plurality of query plans as a target query plan:

the cost corresponding to the execution sequence of the plurality of query operators is minimum;

the data transmission quantity corresponding to the target working nodes is minimum;

the computational power resources required for the query task assigned to each of the target working nodes are less than the computational power resources of the target working node.

3. The method of claim 1, wherein the computational power resources of each of the working nodes are determined by:

Determining hardware configuration information of each working node in a plurality of working nodes, wherein the hardware configuration information comprises the core number of a central processing unit and the content capacity;

And carrying out weighting processing on each item of information included in the hardware configuration information of each working node based on a preset computing power resource evaluation strategy, and obtaining the computing power resource of each working node based on each item of weighted information.

4. The method according to claim 1, wherein the method further comprises:

detecting the running state of each target working node in real time;

And under the condition that the existence of the fault target working node is detected, distributing the query task corresponding to the fault target working node to any other target working node except the fault target working node in at least one target working node.

5. The method according to claim 1, wherein the method further comprises:

responding to the query request, and analyzing query sentences carried in the query request by utilizing a grammar analyzer to obtain a target abstract grammar tree;

Extracting a plurality of query operators included in the query statement from the target abstract syntax tree;

A plurality of query plans are generated based on a plurality of query operators included in the query statement.

6. The method of claim 5, wherein the parsing the query statement carried in the query request in response to the query request to obtain the target abstract syntax tree comprises:

Responding to a query request, performing lexical analysis on query sentences carried in the query request, and decomposing the query sentences based on lexical analysis results to obtain decomposed query sentences;

and carrying out grammar analysis on the decomposed query sentences, and constructing a target abstract grammar tree based on grammar analysis results.

7. The data query device of the distributed database is characterized by being applied to a database management node in a distributed cluster with the distributed database, and comprises:

the construction module is used for constructing a query plan evaluation model based on data transmission time delay between every two working nodes in the plurality of working nodes included in the distributed cluster, computing power resources of each working node and data storage layout information;

the evaluation module is used for calling the query plan evaluation model under the condition of receiving a query request, respectively evaluating a plurality of query plans generated based on the query request, and determining a target query plan from the plurality of query plans based on evaluation results, wherein the target query plan comprises a plurality of query operators;

And the processing module is used for pushing down each query operator contained in the target query plan to each target working node where the data to be queried are located based on a preset pushing down strategy, so that each target working node executes each query operator based on the target query plan, and a data query result is obtained.

8. A computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any one of claims 1 to 6 when the computer program is executed by the processor.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.