CN119415973B

Movatterモバイル変換

Info

Publication number: CN119415973B
Application number: CN202510009393.0A
Authority: CN
Inventors: 张功贯; 钱雁嫦; 马琪坤; 吴怡雯; 杨晓峰; 陈鹏; 蒋杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2025-01-03
Filing date: 2025-01-03
Publication date: 2025-08-01
Anticipated expiration: 2045-01-03
Also published as: CN119415973A

Abstract

The application discloses a data processing method, a device, a product and equipment, wherein the method comprises the steps of obtaining a candidate plan tree of an operation sentence and N sub-trees of the candidate plan tree, N is a positive integer, the candidate plan tree is used for executing the operation sentence, obtaining a data set, the data set comprises resource data consumed by M historical sub-trees in a historical execution process, the historical sub-trees are sub-trees of the historical plan tree, M is a positive integer, performing first matching processing on the resource data in the data set, wherein the first matching processing is used for screening matched resource data through the similarity between the N sub-trees and the M historical sub-trees to N sub-trees, determining reference resource data of the N sub-trees based on a first matching result obtained by the first matching processing, and estimating execution cost corresponding to the candidate plan tree by adopting the reference resource data of the N sub-trees. By adopting the method and the device, the accuracy of the estimated execution cost of the candidate planning tree can be improved.

Description

Data processing method, device, product and equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, product, and device.

Background

The SQL statement (structured statement) is a statement for querying and operating data in a database, in practical application, the SQL statement is converted into a plurality of candidate plan trees, and then a candidate plan tree for executing the SQL statement is determined from the plurality of candidate plan trees through execution costs corresponding to the plurality of candidate plan trees. It can be seen that the execution cost corresponding to the candidate planning tree of the SQL sentence plays an extremely important role in the execution process of the SQL sentence, therefore, how to obtain the execution cost corresponding to the candidate planning tree of the SQL sentence more accurately is a hot spot problem.

Disclosure of Invention

The application provides a data processing method, a device, a product and equipment, which can improve the accuracy of estimated execution cost of a candidate planning tree.

In one aspect, the present application provides a data processing method, including:

acquiring a candidate plan tree of an operation sentence and N sub-trees of the candidate plan tree, wherein N is a positive integer, and the candidate plan tree is used for executing the operation sentence;

Acquiring a data set, wherein the data set comprises resource data consumed by M historical subtrees in the historical execution process, the historical subtrees are subtrees of a historical plan tree after historical execution, and M is a positive integer;

performing first matching processing of resource data on the N sub-trees in the data set, wherein the first matching processing is used for screening matched resource data for the N sub-trees through the similarity between the N sub-trees and the M historical sub-trees;

determining the reference resource data of N sub-trees based on a first matching result obtained by the first matching process;

And estimating the execution cost corresponding to the candidate plan tree by adopting the reference resource data of the N subtrees.

In one aspect, the present application provides a data processing apparatus, comprising:

the first acquisition module is used for acquiring a candidate planning tree of the operation sentence and N sub-trees of the candidate planning tree, wherein N is a positive integer, and the candidate planning tree is used for executing the operation sentence;

the second acquisition module is used for acquiring a data set, wherein the data set comprises resource data consumed by M historical subtrees in the historical execution process, the historical subtrees are subtrees of a historical plan tree which is executed in a historical manner, and M is a positive integer;

The matching module is used for carrying out first matching processing on the resource data of the N sub-trees in the data set, and the first matching processing is used for screening matched resource data for the N sub-trees through the similarity between the N sub-trees and the M historical sub-trees;

The determining module is used for determining the reference resource data of the N sub-trees based on a first matching result obtained by the first matching process;

And the estimating module is used for estimating the execution cost corresponding to the candidate plan tree by adopting the reference resource data of the N subtrees.

In one embodiment, the matching module performs a first matching process on the resource data for the N sub-trees in the dataset, including:

respectively carrying out standardized representation processing on each sub tree in the N sub trees to generate standard representation information of each sub tree;

Performing first matching processing of resource data on the N subtrees in the data set based on the standard representation information of the N subtrees;

the method comprises the steps of storing standard representation information of each historical subtree in a data set, wherein resource data consumed by each historical subtree in the data set has a mapping relation with the standard representation information of each historical subtree respectively, and the similarity between the standard representation information of N subtrees and the standard representation information of M historical subtrees is used for reflecting the similarity between N subtrees and M historical subtrees.

In one embodiment, any of the N subtrees is a first subtree, and the dataset includes M pieces of standard representation information for M historical subtrees;

The matching module performs a first matching process on the resource data of the N subtrees in the data set based on the standard representation information of the N subtrees, and includes:

Obtaining information similarity between standard representation information of a first subtree and M pieces of standard representation information respectively;

determining standard representation information with the maximum information similarity with standard representation information of a first subtree in the M pieces of standard representation information as undetermined representation information;

If the information similarity between the standard representation information and the undetermined representation information of the first subtree is greater than or equal to a set similarity threshold value, taking the resource data with a mapping relation between the data set and the undetermined representation information as the screened resource data matched with the first subtree;

The first matching result comprises the screened resource data matched with the first subtree.

In one embodiment, the matching module is further configured to:

If the information similarity between the standard representation information and the undetermined representation information of the first subtree is smaller than a similarity threshold, determining that the resource data matched with the first subtree is not screened in the data set;

wherein the first matching result includes resource data that is not screened out of the dataset for matches with the first subtree.

In one embodiment, the N sub-trees are each composed of one or more operator nodes of the candidate planning tree, and the determining module determines the manner of the reference resource data of the N sub-trees based on the first matching result obtained by the first matching process, and includes:

If the first matching result comprises the screened resource data matched with the first subtree, using the screened resource data matched with the first subtree as the reference resource data of the first subtree;

if the first matching result includes resource data which is not screened to match with the first subtree, determining the reference resource data of the first subtree based on the number of operator nodes contained in the first subtree.

In one embodiment, the determining module determines, based on the number of operator nodes included in the first sub-tree, a manner of referencing resource data of the first sub-tree, including:

When the first subtree contains one operator node of the candidate plan tree, taking the resource data consumed by the table data called by the first subtree as the reference resource data of the first subtree;

When the first subtree contains a plurality of operator nodes of the candidate plan tree, performing second matching processing on the resource data of the first subtree in the data set, and determining reference resource data of each operator node in the first subtree based on a second matching result obtained by the second matching processing, wherein the reference resource data of the first subtree comprises the reference resource data of each operator node in the first subtree, and the second matching processing is used for screening matched resource data for each operator node in the first subtree through similarity between each operator node in the first subtree and M historical subtrees.

In one embodiment, any operator node in the first subtree is a target operator node, and the determining module determines the reference resource data mode of each operator node in the first subtree based on a second matching result obtained by the second matching process, including:

if the second matching result comprises the screened resource data matched with the target operator node, using the screened resource data matched with the target operator node as the reference resource data of the target operator node;

And if the second matching result comprises resource data which is not screened and matched with the target operator node, taking the resource data consumed by the table data called by the target operator node as the reference resource data of the target operator node.

In one embodiment, any one of the N subtrees is a first subtree, and the matching module performs standardized representation processing on each subtree of the N subtrees to generate standard representation information of each subtree, and the method comprises the following steps:

Carrying out data analysis processing on the first subtree from K data fields to obtain the representation data of the first subtree in K data fields, wherein K is a positive integer;

Representing the representation data of the first subtree in K data fields according to the set standard format to generate standard representation information of the first subtree;

The K data fields comprise at least one operator field corresponding to an operator called by a subtree, an object field corresponding to a data object in table data called by the subtree, and a condition field corresponding to a condition contained in the operator called by the subtree.

In one embodiment, the first obtaining module obtains a flow of N subtrees of the candidate plan tree, including:

Identifying branch nodes in the candidate plan tree, and dividing each branch under the branch nodes into first type subtrees of the candidate plan tree respectively;

Respectively constructing a second type subtree of the candidate plan tree through each operator node except the first type subtree in the candidate plan tree, wherein one operator node except the first type subtree in the candidate plan tree is used for constructing and obtaining one second type subtree;

and taking the first type subtree and the second type subtree as N subtrees obtained by dividing the candidate planning tree.

In one embodiment, the plurality of candidate planning trees of the operation sentence is provided, and the data processing apparatus further includes an execution module for:

Selecting a candidate plan tree with the minimum corresponding execution cost from the plurality of candidate plan trees as a target plan tree of the operation sentence;

And executing the operation statement by adopting the target plan tree.

In one embodiment, the resource data consumed by each historical subtree in the dataset has a mapping relationship with the respective standard representation information of each historical subtree, and any subtree of the target plan tree is a second subtree;

After executing the target plan tree, the execution module is further configured to:

if the data set does not contain standard representation information of the second subtree, acquiring resource data consumed by the second subtree in the execution process after the second subtree is executed;

and constructing a mapping relation between the resource data consumed by the second subtree and the standard representation information of the second subtree, and adding the constructed mapping relation into the data set.

In one embodiment, the manner in which the execution module obtains the resource data consumed by the second subtree during execution includes:

acquiring input and output data of the operator node at the last position in the second subtree, and taking the input and output data of the operator node at the last position as the input and output data of the second subtree;

acquiring memory data consumed by each operator node in the second subtree, and taking the memory data consumed by the operator node in the second subtree at most as the memory data consumed by the second subtree;

Acquiring the processor data consumed by each operator node in the second subtree, and taking the processor data consumed by the operator node in the second subtree at most as the processor data consumed by the second subtree;

The resource data consumed by the second subtree comprises input and output data of the second subtree, memory data consumed by the second subtree and processor data consumed by the second subtree.

In one embodiment, the first obtaining module obtains a flow of a candidate planning tree of an operation sentence, including:

Performing statement analysis processing on the operation statement to generate a logic plan tree of the operation statement;

and performing conversion processing on the logic plan tree to generate a plurality of candidate plan trees of the operation sentences.

In one embodiment, the estimating module adopts reference resource data of N subtrees to estimate an execution cost corresponding to the candidate planning tree, and the method includes:

inputting the reference resource data of the N subtrees into a cost estimation model;

And calling a cost estimation model to estimate the execution cost corresponding to the candidate plan tree based on the reference resource data of the N subtrees.

In one aspect the application provides a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the method of one aspect of the application.

An aspect of the application provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method of the above aspect.

According to one aspect of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program to cause the computer device to execute the method provided in various optional manners of the above aspect and the like.

The method and the device can acquire candidate plan trees of operation sentences and N sub-trees of the candidate plan trees, wherein N is a positive integer, the candidate plan trees are used for executing the operation sentences, a data set can be acquired, the data set comprises resource data consumed by M historical subtrees in the historical execution process, the historical subtrees are subtrees of the historical plan trees which are executed in the historical mode, M is a positive integer, first matching processing of the resource data can be conducted on the N sub-trees in the data set, the first matching processing is used for screening matched resource data for the N sub-trees through similarity between the N sub-trees and the M historical subtrees, accordingly, reference resource data of the N sub-trees can be determined based on first matching results obtained through the first matching processing, and execution costs corresponding to the candidate plan trees can be estimated through the reference resource data of the N sub-trees. Therefore, the method provided by the application can acquire the reference resource data of each subtree in the candidate plan tree by taking the subtree as granularity, and the reference resource data can be obtained by matching the resource data actually consumed by the history subtree in the actual history execution process, so that the accuracy and the reliability of the reference resource data acquired by each subtree are ensured, and the execution cost corresponding to the candidate plan tree can be accurately estimated through the accurate and reliable reference resource data of each subtree.

Drawings

In order to more clearly illustrate the application or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a network architecture for data processing according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a scenario for obtaining corresponding execution costs of a candidate planning tree according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a candidate planning tree according to an embodiment of the present application;

FIG. 5 is an interface schematic diagram of a diagnostic interface for operator nodes provided by an embodiment of the present application;

FIG. 6 is a schematic flow chart of a first matching process of resource data for N sub-trees in a dataset according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a framework for collecting resource data consumed by a historic subtree according to an embodiment of the application;

FIG. 8 is a schematic diagram of a framework for executing an operation statement according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a scenario for matching resource data according to an embodiment of the present application;

FIG. 10 is a flowchart of determining reference resource data of N subtrees according to a first matching result according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a subtree structure of a first subtree according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the application are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

All data collected by the method (such as related data of operation sentences, candidate plan trees, subtrees of the candidate plan trees, historical subtrees, resource data consumed by the historical subtrees and the like) are collected under the condition that a affiliated person (such as a user, an organization or an enterprise) of the data agrees and authorizes, and the collection, the use and the processing of the related data need to comply with related laws and regulations and standards of related areas.

SQL Structured Query Language, the structured query language, is a standard programming language for managing and operating relational databases. Where SQL statements are instructions for performing specific tasks or queries, including creation, reading, updating, deleting, etc., of data, these operations may generally be referred to simply as CRUD operations (basic operations in database management).

The relation algebra is a mathematical theory for describing and operating data in a database, and the theory abstracts a series of operation operators such as selection, projection, merging, intersection, difference, cartesian product, natural connection, division and the like for the operation of the data, and the combination of the operation operators can form complex data processing operation and provide a theoretical basis for the design of SQL language.

The abstract syntax tree is called AST for short, the lexical analyzer can decompose SQL source codes into individual lexical units (tokens), and the lexical units are organized into a tree structure according to the grammar rules by the syntax analyzer, and the tree structure can be called as the abstract syntax tree.

The logic execution plan is a plan generated in the SQL query optimization process and is used for describing the logic operation sequence and the execution mode of the SQL query, and the section translates the SQL logic into the execution sequence based on the operator composition in the relational algebra theory through AST without considering specific physical implementation details. Wherein, the logic execution plan can also be a tree structure, and the logic execution plan can be called as a logic plan tree.

The physical execution plan is a plan generated in the SQL query optimization process and is used for describing the specific physical operation sequence and execution mode of the SQL query, and details such as a bottom storage structure, an index and an execution algorithm are considered. The physical execution plan is the code logic that SQL actually runs. Wherein the physical execution plan may also be a tree structure, and the physical execution plan may be referred to as a physical plan tree.

The execution operator subtree (which can be called subtree for short) can be expressed as a directed acyclic graph, and the execution operator subtree is the local logic of the directed acyclic graph and can completely express the local segment of the data processing logic.

The CBO model is a cost-based optimization model (also called cost model), and is an optimization method for comprehensively considering various factors and selecting an optimal execution plan (such as an optimal physical execution plan) through cost calculation. The CBO model may be used to make predictions of corresponding execution costs for the physical planning tree.

Referring to fig. 1, fig. 1 is a schematic diagram of a network architecture for data processing according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a server 200 and a cluster of terminal devices, which may include one or more terminal devices, the number of which will not be limited here. As shown in fig. 1, the plurality of terminal devices may specifically include a terminal device 1, a terminal device 2, a terminal device 3, a terminal device n, and as shown in fig. 1, the terminal device 2, the terminal device 3, the terminal device n may be connected to the server 200 through a network, so that each terminal device may interact with the server 200 through the network connection.

The server 200 shown in fig. 1 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal equipment can be intelligent terminals such as intelligent mobile phones, tablet personal computers, notebook computers, desktop computers, intelligent televisions, vehicle-mounted terminals, intelligent home and the like. A specific description of an embodiment of the present application will be made below taking communication between the terminal device 1 and the server 200 as an example.

Here, the present application takes the processed SQL statement as an example to perform related description, and in the practical application scenario, the processed SQL statement may be not only a statement for performing data query, but also a statement for performing other operations (such as deletion, modification, and addition) on data. Wherein the user may trigger a data query operation in the terminal device 1, the terminal device 1 may generate a data query request and send the data query request to the server 200. So that the server 200 can generate a data query statement (which is an SQL statement, which may be an operation statement described in the following procedure) for querying the data requested by the user according to the data query request.

The server 200 may generate a plurality of candidate plan trees (belonging to the physical plan tree) of the data query statement, and may estimate an execution cost corresponding to each candidate plan tree. Therefore, the data query statement can be finally executed through the estimated candidate plan tree with the minimum corresponding execution cost so as to query and acquire the data of the query requested by the user. The server 200 may return the queried and obtained data to the terminal device 1, so that the terminal device 1 may present the obtained data at the terminal interface for review by the user.

While the process of predicting the corresponding execution cost of each candidate planning tree by the server 200 may refer to fig. 2, fig. 2 is a schematic view of a scenario for obtaining the corresponding execution cost of the candidate planning tree according to the embodiment of the present application. As shown in fig. 2, the server 200 in fig. 1 may obtain a data set, where the data set may include resource data actually consumed by M history sub-trees in the history execution process, where M is a positive integer, where M is assumed to be equal to 5, and the M history sub-trees belong to sub-trees of the history plan tree actually executed by the history. And, the resource data consumed by each history subtree in the data set has a mapping relationship with the standard representation information of each history subtree, and the standard representation information of the history subtree is the representation information obtained after the standardized representation processing is performed on the history subtree, and the principle of the standardized representation processing can be referred to the related description in the corresponding embodiment of fig. 6 below.

The server 200 may also sub-tree the candidate plan tree to obtain N sub-trees of the candidate plan tree, where N is a positive integer, and where N is assumed to be equal to 4, the N sub-trees may include sub-tree H1, sub-tree H2, sub-tree H3, and sub-tree H4. The server 200 may further perform standardized representation processing on the N sub-trees to obtain standard representation information of each of the N sub-trees, including standard representation information S1 of the subtree H1, standard representation information S2 of the subtree H2, standard representation information S3 of the subtree H3, and standard representation information S4 of the subtree H4.

The server 200 may use the standard representation information of the N subtrees to perform a first matching process on the resource data in the dataset, where the first matching process is used to filter the matched resource data for the N subtrees to obtain a first matching result according to the similarity between the N subtrees and the M history subtrees, where the first matching result may include the reference resource data C1 matched for the subtree H1, the reference resource data C2 matched for the subtree H2, the reference resource data C3 matched for the subtree H3, and the reference resource data C4 matched for the subtree H4. The similarity between the N subtrees and the M history subtrees may be represented by the similarity between the standard representation information of the N subtrees and the standard representation information of the M history subtrees, so it may be understood that the server 200 may filter the resource data matched with the N subtrees in the data set to obtain the first matching result by the similarity between the standard representation information of the N subtrees and the standard representation information of the M history subtrees. The specific procedure for obtaining the first matching result through the first matching process may also be referred to as detailed description in the following embodiments.

The server 200 may determine the reference resource data of each of the N sub-trees according to the obtained first matching result, so that the execution cost corresponding to the candidate planning tree may be estimated by using the reference resource data of each of the N sub-trees. This particular procedure can also be seen in the relevant description of the embodiments described below.

The method provided by the application can acquire the reference resource data of the subtree in fine granularity, and the reference resource data can be obtained by matching the resource data actually consumed by the historical subtree in the data set in the actual execution process, so that the reference resource data determined by the subtree of the candidate plan tree can be used for accurately predicting the execution cost corresponding to the candidate plan tree and reducing the expenditure brought by executing SQL sentences.

Referring to fig. 3, fig. 3 is a flowchart of a data processing method according to an embodiment of the application. The execution body in the embodiment of the present application may be a data processing device (may be simply referred to as a processing device), where the processing device may be a computer device or a cluster of computer devices formed by multiple computer devices, and the computer device may be a server, a terminal device, or other devices. As shown in fig. 3, the method may include:

Step S101, a candidate planning tree of an operation sentence and N sub-trees of the candidate planning tree are obtained, wherein N is a positive integer, and the candidate planning tree is used for executing the operation sentence.

Specifically, the operation statement may be an SQL statement, which may be used to query or operate (e.g., add, delete, modify, etc.) data in the data table. The processing device can acquire the operation statement, the operation statement can be any SQL statement which needs to be executed currently, and the SQL statement can be determined according to the actual application scene.

The processing device may also obtain a candidate planning tree for the operation statement, where the candidate planning tree may be a physical planning tree for the operation statement, and the candidate planning tree may have a plurality of candidate planning trees. The process of obtaining the candidate planning tree of the operation sentence by the processing device may include that the processing device may perform syntax parsing processing on the operation sentence to generate an abstract syntax tree of the operation sentence, for example, the processing device may perform syntax parsing processing on the operation sentence by using a configured lexical analyzer and a syntax analyzer to generate the abstract syntax tree of the operation sentence.

The processing device may perform a logically optimized correlation process on the abstract syntax tree of the operation sentence to generate a logical plan tree of the operation sentence. For example, the processing device may optimize and convert the nodes in the abstract syntax tree into the nodes in the logic plan tree according to the configured processing logic, thereby obtaining the logic plan tree of the operation sentence. Thus, the processing apparatus can perform conversion processing on the logical plan tree of the operation sentence to obtain a plurality of physical plan trees of the operation sentence, each of which can be a candidate plan tree of the operation sentence in the present application. For example, the processing device may obtain a configured Query Optimizer (Query Optimizer) that may have a series of optimization rules and rewriters (Rewriter) by which to generate and derive a physical plan tree based on the structure and attributes of the logical plan tree, as well as the characteristics of the underlying storage system and execution engine.

Since the principles of acquiring the execution cost corresponding to each candidate planning tree are independent and the same, a process of acquiring the execution cost corresponding to one candidate planning tree is specifically described in the following process, that is, the number of candidate planning trees is temporarily de-emphasized in the following process, and is collectively referred to as the candidate planning tree, as described in the following description.

The processing device may further obtain N sub-trees of the candidate plan tree of the operation sentence, where N is a positive integer, and a specific value of N may be determined according to an actual application scenario, and the number of sub-trees divided by different candidate plan trees may be the same or different. I.e. the candidate plan tree may be divided into the N sub-trees. Wherein, the candidate planning tree may include at least one operator node (may be simply referred to as a node), where an operator node may be a basic building block of the candidate planning tree, and an operator node may represent a specific execution operation on the data table. Based on this, one sub-tree of the candidate plan tree may be formed by at least one operator node in the candidate plan tree, i.e. one sub-tree of the candidate plan tree may contain one or more operator nodes in the candidate plan tree.

In an exemplary embodiment, the process of the processing device obtaining the N sub-trees of the candidate plan tree may include that the processing device may identify branch nodes (also belonging to operator nodes of the candidate plan tree) in the candidate plan tree, where there may be one or more branch nodes, or in special cases, no branch nodes (indicating that there are no branch nodes in the candidate plan tree), and at least 2 branches may be connected under each branch node.

Thus, the processing device may divide each branch under the branch node into a sub-tree of the candidate plan tree, respectively, which may be referred to as a first type sub-tree. A branch under a branch node may be divided into a first type of subtrees, and thus the number of first type of subtrees may be equal to the number of all branches under the respective branch node.

And the processing device may further construct a sub-tree of the candidate plan tree, which may be referred to as a second type sub-tree, respectively, by each operator node of the candidate plan tree other than the first type sub-tree. One operator node in the candidate plan tree other than the first type of subtree may be used to construct a second type of subtree, i.e., the number of second type of subtrees may be equal to the number of operator nodes in the candidate plan tree other than the first type of subtree.

The processing device may divide the first type subtree and the second type subtree into N subtrees that are obtained by dividing the candidate planning tree, where the N subtrees may include each of the first type subtree and each of the second type subtrees obtained by the dividing.

The above manner of sub-tree division of the candidate plan tree is only described by way of example, and in actual application requirements, each sub-tree of the candidate plan tree may be further obtained by dividing in any other feasible manner, so that granularity of the sub-tree is used to obtain reference resource data of the candidate plan tree (may include reference resource data of each sub-tree of the candidate plan tree), and this specific process may be described in the following.

Step S102, a data set is obtained, wherein the data set comprises resource data consumed by M history subtrees in the history execution process, the history subtrees are subtrees of a history plan tree which is executed in a history manner, and M is a positive integer.

Specifically, the processing device may acquire a data set, where the data set may include resource data actually consumed by each of M historical subtrees in the historical execution process, M is a positive integer, a specific value of M may be determined according to an actual application scenario, where M may be a number of resource data included in the current data set when acquiring an execution cost corresponding to the candidate plan tree, and one historical subtree may have one resource data actually consumed.

The history subtree may be a subtree of a history planning tree actually executed by the history, the history planning tree may be a physical planning tree actually executed by the SQL statement executed by the history, and a division manner of the history subtree of the history planning tree may be the same as a division manner of the subtree of the candidate planning tree.

According to the data set, when the physical planning tree of each SQL statement is executed, the resource data actually consumed by each subtree in the physical planning tree in the execution process can be collected according to the granularity of the subtree, and the resource data are stored in the data set, so that the reference resource data of the physical planning tree to be executed subsequently can be accurately and reliably obtained through the data set, as described below.

Step S103, performing first matching processing of resource data on the N sub-trees in the data set, wherein the first matching processing is used for screening matched resource data for the N sub-trees through similarity between the N sub-trees and the M historical sub-trees.

Specifically, the processing device may perform a matching process of the resource data on the N sub-trees in the data set, and the matching process may be referred to as a first matching process, where the first matching process may be used to screen the matching resource data for the N sub-trees from the data set by using similarities between the N sub-trees and the N history sub-trees.

During this first matching process, the more similar the resource data of the historical sub-tree is between the dataset and the sub-tree of the candidate plan tree, the more the resource data of the historical sub-tree matches the sub-tree of the candidate plan tree. The first matching result may be obtained through the first matching process, where the first matching result may include resource data that is screened from the dataset and matches a subtree of the N subtrees, or may include resource data that is not screened from the dataset and matches a subtree of the N subtrees. The specific implementation of the first matching process may be referred to as the description of the corresponding embodiment of fig. 6 below.

Step S104, determining the reference resource data of the N sub-trees based on the first matching result obtained by the first matching process.

Specifically, the processing device may determine, through the first matching result obtained by the first matching process, respective reference resource data of each sub tree in the N sub trees, where one sub tree may have one reference resource data, and the reference resource data of one sub tree is the resource data that is required to be consumed by the referenced sub tree in the actual execution, that is, the reference resource data of one sub tree may be the resource data that is evaluated in advance for the sub tree and may be consumed by the sub tree in the actual execution.

The specific process of determining the reference resource data of the N subtrees of the candidate plan tree according to the first matching result obtained by the first matching process may be referred to as a specific description in the corresponding embodiment of fig. 10 below.

Step S105, the execution cost corresponding to the candidate plan tree is estimated by adopting the reference resource data of the N subtrees.

Specifically, the processing device may use the obtained reference resource data of the N subtrees to estimate and obtain execution costs corresponding to candidate planning trees to which the N subtrees belong.

The process of obtaining the execution cost corresponding to the candidate planning tree to which the N subtrees belong through the pre-estimation of the reference resource data may include that the processing device may obtain a cost pre-estimation model, which may be simply referred to as a cost model (may be a CBO model), and the cost pre-estimation model may be a trained model that may be used for performing cost pre-estimation on the physical planning tree. The processing device may input the reference resource data of the N sub-trees into the cost estimation model, so as to invoke the cost estimation model to estimate the execution cost corresponding to the candidate planning tree through the reference resource data of the N sub-trees.

In one embodiment, the cost estimation model can estimate the execution cost corresponding to the current operator node through the execution cost corresponding to the previous operator node of the current operator node and the reference resource data of the current operator node, and estimate the execution cost corresponding to each operator node of the candidate plan tree from bottom to top by adopting the principle, so that the corresponding execution cost estimated to the top operator node (i.e. the root node) in the candidate plan tree can be used as the estimated execution cost of the whole candidate plan tree.

In the application, the reference resource data of the subtree can be obtained by the granularity of the subtree, so the cost estimation model can also change the execution cost corresponding to the original estimated operator node by the granularity of the operator node into the execution cost corresponding to the subtree by the granularity of the subtree, and the principle can also be to estimate the execution cost corresponding to the current subtree by the execution cost corresponding to the last subtree of the current subtree and the reference resource data of the current subtree. The last estimated subtree in the method is also the root node of the candidate planning tree, so that the estimated execution cost of the root node can be used as the estimated execution cost of the whole candidate planning tree.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a candidate planning tree according to an embodiment of the present application. As shown in fig. 4, the candidate plan tree may include operator nodes "Project (id, name, subject, score)" (a projection operator, where id, name, subject, score may be called data objects, or may be called fields in a data table), operator nodes "SortMergeJoin (a.id=b id)" (a join operator, which may be used to perform a table join operation), operator nodes "Project (id, name)" (a projection operator, id and name represent called fields or data objects), operator nodes "Filter (generator= MALE AND AGE > 18)" (a Filter operator, where "generator= MALE AND AGE >18" is a Filter operator, where Filter operator is a female whose Filter age is greater than 18), operator nodes "TableScan (student)" (a data scanner operator, which may be simply called data objects, where student nodes "Project (subject, subject) or" Filter operator (Filter operator) 54, and "Filter operator (Filter operator) 4 (Filter operator) represent called fields or" Filter operator (Filter operator) 4 ".

The operator node "Project (id, name, subject, score)" is the root node of the candidate planning tree. Operator node "SortMergeJoin (a.id=b.id)" is a branch node of the candidate planning tree under which two branches are connected, including branch 1 (i.e., left branch) and branch 2 (right branch) here. The branch 1 and the branch 2 comprise 3 operator nodes.

In the embodiments corresponding to fig. 6 and fig. 10 described below, if the resource data matched with the subtree 3 is not screened from the data set, the execution cost corresponding to each operator node in the subtree 3 may be estimated in the subtree 3 in a bottom-up manner according to the granularity of the operator nodes, and the estimated execution cost of the uppermost operator node "Project (id, name)" in the subtree 3 may be used as the execution cost corresponding to the subtree 3, for the subsequent estimation of the execution cost corresponding to the subtree 2. And, the subtree 4 is the same, or the subtree containing a plurality of operator nodes and matching resource data is not screened out.

Here, if the resource data consumed by screening a history subtree is matched with the subtree 3, that is, the history subtree and the subtree 3 can be regarded as the same, the resource data consumed by the history subtree in the dataset can be directly taken as the reference resource data of the subtree 3.

In the application, the reference resource data of the subtree can be obtained according to the granularity of the subtree, so that the execution cost corresponding to the subtree can be estimated according to the granularity of the subtree, thereby improving the estimation efficiency of the execution cost corresponding to the candidate planning tree.

The processing device may obtain an execution cost corresponding to each candidate planning tree of the operation sentence according to the above-described principle, where one candidate planning tree may have a corresponding execution cost, as the name implies, an execution cost corresponding to one candidate planning tree is a cost required to execute the candidate planning tree, and the smaller the execution cost is, the less overhead (such as overhead of computer resources, or time overhead) is required to execute the operation sentence. Therefore, the processing device may select, from the plurality of candidate plan trees of the operation sentence, a candidate plan tree having the smallest corresponding execution cost as a target plan tree that is the operation sentence, that is, a physical plan tree that is ultimately used to execute the operation sentence, and the target plan tree may be an optimal plan tree of the operation sentence. The processing device may use the target plan tree to execute the operation statement, and execute the target plan tree, that is, finally execute the operation statement, for example, by executing the target plan tree, a corresponding operation (such as querying, adding, deleting, modifying, etc.) of the operation statement on the data in the data table may be implemented.

The resource data consumed by each history subtree in the data set can have a mapping relation with the standard representation information of each history subtree, and the standard representation information of one history subtree is the information obtained by carrying out standardized representation processing on the history subtree. For a specific process of obtaining the standard representation information of the history subtree, reference may be made to the following description of obtaining the standard representation information of N subtrees in the corresponding embodiment of fig. 6.

Any sub-tree of the target plan tree may be referred to as a second sub-tree, and after the target plan tree is executed, if the data set does not include standard representation information of the second sub-tree (the standard representation information actually refers to standard representation information of a history sub-tree similar to the second sub-tree that has been executed historically), the second sub-tree may be regarded as a new history sub-tree, and resource data actually consumed by the second sub-tree during the actual execution may be newly added to the data set, as described below.

Therefore, after the target plan tree is executed, if the data set does not include standard representation information of the second subtree, which indicates that the data set does not include resource data actually consumed by a history subtree similar to the second subtree, which is executed historically, the processing device may acquire the resource data actually consumed by the second subtree in the executing process after the second subtree is executed. The processing device may construct a mapping relationship between the resource data consumed by the second subtree and the standard representation information of the second subtree, and may add the constructed mapping relationship to the dataset, so that the dataset may include the resource data consumed by the second subtree in the actual execution process.

If the data set includes standard representation information of the second subtree (the standard representation information is actually standard representation information of a history subtree similar to the second subtree after history execution), which indicates that the data set already includes resource data actually consumed by the history subtree similar to the second subtree after history execution, the processing device does not need to repeatedly add the resource data actually consumed by the second subtree to the data set again after executing the second subtree.

In one embodiment, the manner of determining whether the data set contains the standard representation information of the second subtree may include:

The dataset may contain M standard representation information for the M history sub-trees described above. The processing device may obtain the information similarity between the standard representation information of the second subtree and the M standard representation information, where the principle of obtaining the information similarity between the standard representation information of the second subtree and the M standard representation information is the same as the principle of obtaining the information similarity between the standard representation information of the first subtree and the M standard representation information in the embodiment corresponding to fig. 6 described below, and may be specifically referred to the related description in the embodiment corresponding to fig. 6 described below. Wherein the second subtree may be the same or different from the first subtree.

The processing device may acquire standard representation information having the greatest information similarity with standard representation information of the second subtree from the M pieces of standard representation information, and may refer to the standard representation information having the greatest information similarity with standard representation information of the second subtree as reference representation information. That is, the reference representation information is one of the M standard representation information having the greatest degree of similarity with the standard representation information of the second sub-tree.

If the information similarity between the standard representation information of the second sub-tree and the reference representation information is greater than or equal to the set similarity threshold, the standard representation information of the second sub-tree can be regarded as identical to the reference representation information, i.e. the second sub-tree is regarded as identical to the history sub-tree to which the reference representation information belongs, and the standard representation information of the second sub-tree can be considered as included in the data set. The similarity threshold may be a minimum information similarity set to evaluate that the standard representation information is the same, and the specific value of the similarity threshold may be determined according to the actual application scenario.

If the information similarity between the standard representation information of the second sub-tree and the reference representation information is smaller than the set similarity threshold, the standard representation information of the second sub-tree and the reference representation information can be regarded as different, i.e. the second sub-tree is regarded as different from the history sub-tree to which the reference representation information belongs, and the standard representation information of the second sub-tree is not included in the data set.

In one embodiment, the process of obtaining the resource data actually consumed by the second subtree in the execution process by the processing device may include that after the second subtree is executed, the processing device may obtain input/output data (i.e. IO data) of the last operator node in the second subtree, and may use the input/output data of the last operator node as the input/output data of the second subtree, where the input/output data may include the input data amount and the output data amount of the last operator node, or may further include the logic interaction times (such as the logic data interaction times) of the operator node. The last operator node may be the lowest operator node in the second sub-tree. In one embodiment, the input/output data of the subtree may include input/output data of the subtree for a disk, and input/output data of the subtree for a network (Net), where the input/output data of the subtree for the disk may represent a logical interaction number of the subtree on the disk, and the input/output data of the subtree for the network may represent a logical interaction number of the subtree on the network and a data transmission amount (e.g., may include an input data amount and an output data amount).

The processing device may further obtain memory data consumed by each operator node in the second subtree in the running process, where the memory data consumed by one operator node may be memory data occupied by the operator node in the running process, where the memory data may be a memory size occupied by the operator node in the running process, for example, a unit of the memory data may be GB (gigabytes), etc. The processing device may use the memory data consumed by the most operator nodes in the second subtree as the memory data consumed by the second subtree, where the memory data consumed by the most operator nodes in the second subtree may be the memory data consumed by the most operator nodes in the second subtree, that is, the memory data is the peak value of the memory data consumed by each operator node in the second subtree.

And the processing device may obtain processor data (may be referred to as CPU data) consumed by each operator node in the second subtree, where the processor data consumed by one operator node may be a duration occupied by the operator node for a CPU (processor) in the running process. The processing device may use the processor data consumed by the most significant processor nodes in the second subtree as the processor data consumed by the second subtree, where the processor data consumed by the most significant processor nodes in the second subtree may be the processor data consumed by the most significant (i.e., longest) operator nodes in the second subtree, and the processor data may be a peak value of the processor data consumed by each operator node in the second subtree.

The processing device may use the obtained input/output data of the second subtree, the memory data consumed by the second subtree, and the processor data consumed by the second subtree as the resource data consumed by the second subtree, that is, the resource data of the second subtree may include the input/output data of the second subtree, the memory data consumed by the second subtree, and the processor data consumed by the second subtree.

More, the operator nodes contained in the operation statement can be diagnosed, for example, the operator nodes can be diagnosed by the input data quantity, the input record quantity, the output data quantity and the output record quantity of the operator nodes, so as to judge whether the operator nodes have unreasonable or abnormal positions in arrangement, for example, if the difference between the input data quantity and the output data quantity of the operator nodes is too large (for example, the difference is larger than a certain data quantity threshold value), the operator nodes can be considered to have unreasonable positions, diagnostic information aiming at the operator nodes can be generated, and the diagnostic information can also comprise the direction and the suggestion for optimizing the operator nodes. Referring to fig. 5, fig. 5 is an interface schematic diagram of a diagnostic interface for operator nodes according to an embodiment of the present application. As shown in fig. 5, operator name "SortMergeJoin" of an operator node, input data amount "57245 bytes (Byte)", input record number "30995894 lines", output data amount "57245 bytes (Byte)", output record number "5 lines", and diagnostic information may be contained in the diagnostic interface. The diagnosis information may include a large table front-end "the Join phase (table connection phase), a large right table input, a left table input of 5 rows and a right table input of 30995889 rows, which are suitable for the front-end as a Base table (Base table). "and MapJoin diagnostics" left table is suitable for use with MapJoin (an optimization technique) ".

The specific logic (may be referred to as diagnostic logic or detection logic) for diagnosing the operator node may be flexibly set according to the actual application requirement.

According to the application, through the similar matching mechanism among subtrees, matched resource data is screened for the subtrees in the data set, and decisions can be provided for root cause analysis during cost prediction of the physical planning tree, so that accurate and rapid cost prediction of the physical planning tree is realized. In addition, in the application, the sub-tree is matched through the physical planning tree produced by the logic planning tree, so that the information pull-through between the logic planning tree and the physical planning tree can be realized, and the sub-tree matching of the physical planning tree can be understood as the matching of the logic planning tree.

Referring to fig. 6, fig. 6 is a flowchart of a first matching process for performing resource data on N sub-trees in a data set according to an embodiment of the present application. The embodiment of the application specifically describes a process of carrying out first matching processing on resource data on N sub-trees in a data set to obtain a first matching result. As shown in fig. 6, the process may include:

In step S201, standardized representation processing is performed on each of the N subtrees, and standard representation information of each subtree is generated.

Specifically, the processing device may perform a normalized representation process (may also be referred to as a normalization process) on each of the N subtrees of the candidate planning tree to generate normalized representation information of each subtree, and one subtree may have one normalized representation information.

Any one of the N sub-trees may be referred to as a first sub-tree, and the following description will specifically take, as an example, a normalized representation process performed on the first sub-tree to generate normalized representation information of the first sub-tree. It should be noted that, the principle of performing the standardized representation processing on each of the N subtrees to generate the respective standard representation information of each subtree is independent and the same.

The processing device may perform data parsing processing on the first subtree from K data fields to obtain representation data of the first subtree in the K data fields, where K is a positive integer, and the K data fields are preset fields for performing standardized representation on the subtree. If the first sub-tree does not parse out the corresponding representation information in a certain data field of the K data fields, the representation information of the first sub-tree in the data field may be represented as null (null).

The processing device may perform a representation process on the representation data of the first subtree under the K data fields according to the set standard format to generate standard representation information of the first subtree. The standard representation information of the first subtree can include data obtained by representing the representation data of the first subtree under the K data fields according to the standard format, and the standard representation information of the first subtree can be understood as being obtained by representing the first subtree according to the unified standard.

The K data fields can comprise at least one operator field corresponding to an operator called by a subtree, an object field corresponding to a data object in table data called by the subtree, and a condition field corresponding to a condition contained in the operator called by the subtree. I.e. the standardized representation processing of the subtrees may contain abstractions over the K data fields.

The operators called by the subtree can be operators contained in the subtree, such as TableScan operators (table scanning operators), filter operators (filtering operators), project operators (projection operators) and the like. The full amount of each operator under the operator domain may belong to each element under the operator domain.

The table data called by the subtree in the data table may be the data called by the subtree in the data table, the table data may be the whole data table or may be one or more partitions in the data table, the table data may contain one or more data objects (i.e. objects), and a data Object may be an entity in the data table. For example, a data object may be an employee, a commodity, a customer, etc., and a data object may generally correspond to a row of data in a data table in which the object properties of the data object may be recorded. The full amount of individual data objects under an object domain may belong to individual elements under the object domain.

The conditions contained by the operator called by the subtree can be the conditions set in the operator. For example, for a Filter operator, filter (generator= 'rule' and age > 18), the condition included in the Filter operator is "generator= 'rule' and age >18", which indicates gender and age greater than 18. The full amount of each condition under a condition field may belong to each element under the condition field.

Thus, the representation data of the first subtree under the operator domain may include the respective operators invoked in the first subtree, which may be referred to as respective elements of the representation data of the first subtree under the operator domain, which each belong to an element under the operator domain. The representation data of the first subtree under the object domain may include respective data objects in the table data called by the first subtree, and similarly, the respective data objects may be referred to as respective elements of the representation data of the first subtree under the object domain, where the respective elements belong to the elements under the object domain. The representation information of the first subtree under the condition field may include various conditions contained by the operator invoked in the first subtree, which may also be referred to as various elements of the representation data of the first subtree under the condition field, which are all elements under the condition field.

The standard format may include an arrangement order set by a standard for each of the elements in each data field (which may refer to a total number of the elements in each data field), a writing format (e.g., a uniform case, a uniform font type text, etc.), and the like. That is, each element in the same data field can have a preset and uniform arrangement sequence and a uniform writing format.

Therefore, the above-mentioned representing the representing data of the first subtree under K data fields according to the set standard format may mean that the representing data of the first subtree under each data field is represented according to the respective unified arrangement order and unified writing format of the elements under each data field, and the representing data under each data field may also have a fixed representing order. For example, elements of the first subtree under each data field may be sequentially represented according to the order of the computing field, the object field and the condition field, and the standard representation information of the first subtree may be generated and obtained by respectively and uniformly arranging the elements and writing the elements.

The processing device may obtain the standard representation information of each of the N subtrees according to the principle of obtaining the standard representation information of the first subtree. The normalized representation process may be referred to as normalization process, and is a representation for performing a unified format on each subtree, and the standard representation information obtained by performing the normalized representation process on the subtree may be referred to as a signature of the subtree, and is used to identify the subtree.

More, in one embodiment, since resource data in the dataset may be collected by multiple SQL calculation engines, and the expression forms of the same operator in different SQL calculation engines may be different, before the standardized representation processing is performed on the subtree, the standardized representation processing (i.e., normalization processing) may be performed on each operator node in the subtree, for example, physical execution operators (i.e., operators in the physical planning tree) in different SQL calculation engines may be unified to relational algebra operators (may be referred to as standard format operators), that is, normalization may be performed by using the theory of "relational algebra" as a standard. For example, the operator nodes such as BroadCastJoin (broadcast connection operator), sortMergeJoin (sequence merging connection operator), hashJoin (hash connection operator) and the like can be unified and normalized into Join (connection operator), and the Scan operator (a scanning operator) of various SQL calculation engines can be unified and classified into projection operator and the like.

In other words, the N subtrees of the present application may be obtained by uniformly representing each operator node in N initial subtrees of the candidate plan tree divided into relational algebra operators, and the present application may be obtained by uniformly representing the operator nodes in the subtrees of the physical plan tree divided into relational algebra operators, and then performing standardized representation processing to generate standard representation information of the subtrees. Therefore, the accuracy of standardized representation processing of the subtree can be ensured, and more accurate standard representation information of the subtree is obtained, so that the accuracy of matching the resource data in the data set is ensured.

In the application, by representing each subtree as the standard representation information of each subtree, the same subtree can be removed, and the problem that the subtree is misused due to different representation modes can be solved, so that the processing equipment can evaluate the similarity among the subtrees more accurately later.

Referring to fig. 7, fig. 7 is a schematic diagram of a framework for collecting resource data consumed by a history sub-tree according to an embodiment of the present application. As shown in FIG. 7, the present application can collect resource data consumed by the historical subtree by a collector (which may also be referred to as a data collector, and may belong to an SDK or a software package), where the resource database, that is, the data set, may include resource data collected from the historical subtree. In the application, the collector can analyze and acquire the resource data actually consumed by each subtree in the physical planning tree currently executed (with small delay and real time) by the SQL calculation engine (one or more SQL calculation engines can be distributed) through the real-time operation entrance in a streaming mode. The executed physical planning tree can be used as the history planning tree, the subtree of the history planning tree is the history subtree, and the collector can store the mapping relation between the collected resource data actually consumed by each subtree in the physical planning tree and the standard representation information of each subtree into the resource database.

And, the collector can analyze and acquire the resource data actually consumed by each sub-tree in the physical plan tree executed before (such as a long time before the current time, such as a specified time interval) through the offline operation entrance in a batch mode. The executed physical planning tree may be the history planning tree, and the subtree of the history planning tree is a history subtree, and the collector may store the mapping relationship between the collected resource data actually consumed by each subtree in the physical planning tree and the standard representation information of each subtree in the resource database.

In one embodiment, the collector may obtain, in real time or offline, metadata (index data for measuring or analyzing data) of the SQL statement during operation, where the metadata may include resource data consumed by each operator node in a physical planning tree of the SQL statement during execution, and the collector may obtain, via resource data consumed by each operator node during execution, resource data consumed by each sub tree of the physical planning tree. By way of example, SQL compute engines may include Spark (a fast, general purpose big data processing engine), prest (a distributed SQL query engine), starrocks (a distributed SQL database that faces online analytics processing), and so forth.

Through the process, the resource data of each subtree of the physical planning tree executed by each SQL calculation engine can be synchronously acquired in parallel through two acquisition modes of real-time acquisition and off-line acquisition, wherein the real-time acquisition can be used as a main acquisition mode, the off-line acquisition can be used as an auxiliary acquisition mode, and the problem that the resource data consumed by a large number of historical subtrees is not acquired in real-time acquisition can be solved through the off-line acquisition mode. The method and the device can efficiently and fully acquire the resource data actually consumed by each subtree of the historically executed physical planning tree by adopting two acquisition modes of real-time acquisition and off-line acquisition in parallel, and are used for carrying out the estimation of the corresponding execution cost of other subsequent physical planning trees.

Step S202, based on the standard representation information of the N subtrees, performing first matching processing of resource data on the N subtrees in the data set.

Specifically, the processing device may perform the first matching processing of the resource data on the N subtrees in the data set through the standard representation information acquired on the N subtrees, so as to obtain a first matching result, as described below.

The resource data consumed by each history subtree in the data set may have a mapping relationship with the standard representation information of each history subtree, that is, the data set may store a mapping relationship between the M history subtrees and the standard representation information of each history subtree, the data set may not directly include the history subtree, but may refer to the history subtree through the standard representation information of the history subtree, and the data set may include the standard representation information of each history subtree.

The standard representation information of the subtree is obtained by carrying out standardized representation on the subtree, and the arithmetic logic actually contained in the subtree is not modified, so that the standard representation information of the subtree can well refer to and represent the subtree. Therefore, the similarity between the standard representation information of the N subtrees and the standard representation information of the M historical subtrees may be used to reflect (i.e. represent) the similarity between the N subtrees and the M historical subtrees, and in short, the similarity between the standard representation information of the subtrees may be used to reflect the similarity between the subtrees.

The first matching process may include a matching process of the resource data for each of the N subtrees in the data set, and principles of the matching process of the resource data for each of the subtrees are independent and the same. Therefore, any one of the N sub-trees may be the first sub-tree, and a process of performing the matching processing of the resource data on the first sub-tree in the data set will be specifically described below as an example.

The dataset may include M standard representation information for the M history sub-trees described above, one history sub-tree having one standard representation information. The processing device may acquire similarities (may be referred to as information similarities) between the standard representation information of the first subtree and the M standard representation information, respectively, and the standard representation information of the first subtree may have an information similarity with one standard representation information of the M standard representation information.

The manner in which the processing device converts the standard representation information into the corresponding feature vectors may include, for example, the processing device may obtain a feature embedding model, which may be text information, and thus, a text feature embedding model, which may be a trained model that may be used to generate feature vectors for text. Therefore, the processing device may call the text feature embedding model to perform feature embedding processing (may also be referred to as feature extraction processing) on the standard representation information, and may generate a feature vector corresponding to the standard representation information.

The processing device may acquire standard representation information with the maximum information similarity with the standard representation information of the first subtree from the M standard representation information, and call the standard representation information with the maximum information similarity with the standard representation information of the first subtree as the undetermined representation information. That is, the undetermined representation information is one of the M standard representation information having the greatest information similarity with the standard representation information of the first sub-tree.

If the information similarity between the standard representation information of the first subtree and the undetermined representation information is greater than or equal to the set similarity threshold, the standard representation information of the first subtree and the undetermined representation information can be regarded as the same, namely the first subtree and the history subtree to which the undetermined representation information belongs are regarded as the same, and at the moment, the processing equipment can use the resource data with the mapping relation between the data set and the undetermined representation information as the screened resource data matched with the first subtree. The first match result may include the filtered resource data that matches the first subtree. The similarity threshold may be a minimum information similarity set to evaluate that the standard representation information is the same, and the specific value of the similarity threshold may be determined according to the actual application scenario.

The processing device may perform the matching processing of the resource data on the N sub-trees according to the same principle of the matching processing of the resource data on the first sub-tree, so as to obtain a matching result for each sub-tree, where the matching result of one sub-tree may be the resource data that is screened out and matched with the sub-tree, or the matching result of one sub-tree may not be screened out and the first matching result may include the respective matching result for each sub-tree in the N sub-trees.

Referring to fig. 8, fig. 8 is a schematic diagram of a framework for executing an operation sentence according to an embodiment of the present application. As shown in fig. 8, the data collector may collect subtree operation data, which is resource data actually consumed by the history subtree in the history execution process, and may be included in the data set. The data collector may further obtain metadata and statistical information of the data table, where the metadata of the data table may define each partition of the data table, and the statistical information of the data table may include a data amount of the data table and a data amount of each partition in the data table.

And if the derivative operator is actually executed subsequently, the resource data consumed by the derivative operator in the actual execution process can be supplemented to the data table, and the data table already contains the input and output data of the derivative operator, so that the supplemented resource data can comprise the memory data and the processor data actually consumed by the derivative operator in the execution process.

Referring to fig. 8, the SQL calculation engine may include an SQL engine connector (for connecting with a calculation engine), a parser (for parsing an SQL statement), an optimizer (for optimizing an SQL statement), e.g., join (a table connection) may be optimized to Mapjoin (an optimization technique), an executor (for executing an SQL statement), and an algorithm service connector, which may be added in the present application, and is used for connecting to an algorithm service (which may be an algorithm provided in the present application for filtering resource data matching a subtree, the principle may be as described in the above description of the embodiment of the present application) to filter resource data matching a subtree of a physical plan tree of an SQL statement that is currently required to be executed from a dataset through the algorithm service.

Referring to fig. 9 again, fig. 9 is a schematic diagram of a scenario for performing resource data matching according to an embodiment of the present application. As shown in fig. 9, the data collector may read the resource data actually consumed by the executed history planning tree in the execution process from the SQL calculation engine, and may obtain the resource data consumed by each operator node in the history planning tree through the resource data, so that the resource data consumed by each history subtree of the history planning tree may be obtained through the resource data consumed by each operator node, and the data collector may map the resource data consumed by each history subtree into the data set through the standard representation information of each history subtree.

Subsequently, the data collector may acquire a new subtree signature (i.e., standard representation information of the subtree), where the new subtree signature may be the standard representation information of the N subtrees, and the data collector may perform matching and recall (e.g., the first matching process) of the subtrees in the dataset through a similarity algorithm (e.g., a similarity algorithm between standard representation information) by using the new subtree signature, so as to filter resource data matched with the N subtrees to perform estimation of corresponding execution costs of the candidate planning tree.

The method provided by the application converts the subtrees of the physical planning tree of the SQL statement into standard representation information for representation, opens up the correlation between the SQL statements, and pulls through the subtree data (such as the resource data consumed by the subtree) among different SQL statements, and supports the cost optimization and the prediction of the SQL in the CBO model through the subtree data, thereby realizing the accurate evaluation of the similarity among the subtrees, and further accurately screening the resource data matched with the subtrees from the data set. In addition, by adopting the method, mapjoin operators are supported in CBO optimization of internal SuperSQL (an SQL computing platform), and in test verification of prest (an open-source distributed SQL query engine) and TPC-DS (a decision support system benchmark test), 99 SQL sentences covering a standard test set (for example, matching resource data are screened by subtrees of a physical planning tree of the 99 SQL sentences) are obtained, so that excellent test effects are obtained.

Referring to fig. 10, fig. 10 is a flowchart illustrating a method for determining reference resource data of N subtrees according to a first matching result according to an embodiment of the present application. Since the principle of acquiring the reference resource data of each of the N subtrees through the first matching result is the same, the following will specifically describe a process of acquiring the reference resource data of the first subtree. As shown in fig. 10, the process may include:

in step S301, if the first matching result includes the screened resource data matched with the first subtree, the screened resource data matched with the first subtree is used as the reference resource data of the first subtree.

Specifically, if the first matching result includes the screened resource data matched with the first subtree, the processing device may directly use the screened resource data matched with the first subtree as the reference resource data of the first subtree.

In step S302, if the first matching result includes not screening out the resource data matching the first subtree, the reference resource data of the first subtree is determined based on the number of operator nodes included in the first subtree.

Specifically, each of the N sub-trees may be formed by one or more operator nodes of the candidate planning tree, so if the first matching result includes resource data that is not screened to match the first sub-tree, the processing device may determine, by using the number of operator nodes included in the first sub-tree, reference resource data of the first sub-tree, as described below.

When the first sub-tree contains one operator node of the candidate plan tree (i.e., when the first sub-tree contains one operator node), the processing device may obtain resource data consumed by the table data called by the first sub-tree, and may use the resource data consumed by the table data called by the first sub-tree as the reference resource data of the first sub-tree. The table data called by the first subtree can be the whole data table or at least one partition in the data table, and the resource data consumed by the data table or the resource data consumed by each partition in the data table can be obtained by respectively setting the data table and each partition in the data table in advance. The resource data consumed by the partitions in the data table may be obtained by analyzing and counting data attributes such as data scale, data distribution condition, field type and the like of the data contained in the partitions in the data table/data table in advance, and the specific analysis and counting business logic may be configured correspondingly by a developer according to an actual application scenario.

And when the first subtree contains a plurality of operator nodes of the candidate plan tree (i.e., when the first subtree contains a plurality of operator nodes), the processing device may perform a second matching process on the resource data for the first subtree in the data set, and a matching result obtained by the second matching process may be referred to as a second matching result. The processing device may determine the reference resource data of each operator node in the first subtree by means of the second matching result obtained by the second matching process, in which case the reference resource data of the first subtree may comprise the reference resource data of each operator node in the first subtree. The second matching process is used to screen the data set for each operator node in the first subtree for matching resource data by similarity between each operator node in the first subtree and the M historical subtrees.

The principle of the second matching process is substantially the same as that of the first matching process, that is, the principle of screening matched resource data for each operator node in the first subtree in the dataset by the similarity between each operator node in the first subtree and M history subtrees is the same as that of screening matched resource data for the N subtrees in the dataset by the similarity between the N subtrees and the M history subtrees, except that the process of the second matching process replaces the N subtrees with the respective operator nodes in the first subtree compared with the process of the first matching process. Each operator node in the first subtree can be respectively used as a corresponding subtree, however, only one operator node can be contained in one subtree, namely, one operator node in the first subtree can be regarded as the subtree containing only the operator node. Therefore, according to the same principle that the matched resource data is screened for the N subtrees in the data set through the similarity between the N subtrees and the M historical subtrees, the matched resource data is screened for the plurality of subtrees in the data set through the similarity between the plurality of subtrees (namely the plurality of operator nodes contained in the first subtree) of the first subtree and the M historical subtrees, so as to obtain a second matching result.

The second matching result may also include a matching result for each operator node in the first subtree, where the matching result for an operator node may be filtered out resource data that matches the operator node, or may be unfiltered resource data that matches the operator node.

If the second matching result includes the screened resource data matched with the target operator node, the screened resource data matched with the target operator node can be directly used as the reference resource data of the target operator node. In this case, it can be regarded that the history subtree to which the resource data matched with the target operator node by the filtering belongs is identical to the target operator node.

If the second matching result includes not screening the resource data matched with the target operator node, the processing device may acquire the resource data consumed by the table data called by the target operator node, and may use the resource data consumed by the table data called by the target operator node as the reference resource data of the target operator node. Similarly, the table data called by the target operator node may be the whole data table or the partition in the data table, and the resource data consumed by the data table or the partition in the data table may be preset.

When the first subtree includes a plurality of operator nodes of the candidate plan tree, the reference resource data of the first subtree includes the reference resource data of each operator node in the first subtree, the cost estimation model may also estimate, according to the original principle, the execution cost corresponding to the current operator node in the first subtree through the execution cost corresponding to the last operator node of the current operator node and the reference resource data of the current operator node, so that the estimated execution cost of the topmost operator node in the first subtree is used as the execution cost of the first subtree, so as to estimate the corresponding execution cost of the subsequent subtree (such as the next subtree of the first subtree).

Referring to fig. 11, fig. 11 is a schematic structural diagram of a subtree structure of a first subtree according to an embodiment of the application. It is assumed that the first subtree contains here operator node 1 "project" (a projection operator), operator node 2 "filter" (a filter operator) and operator node 3 "tablescan" (a scanning operator). Because the first subtree is not screened for the matched resource data, the execution cost can be estimated in the first subtree or at the granularity of the operator nodes. If the processing device can estimate the execution cost corresponding to the operator node 3 through the reference resource data of the operator node 3, can estimate the execution cost corresponding to the operator node 2 through the execution cost corresponding to the operator node 3 and the reference resource data of the operator node 2, and can estimate the execution cost corresponding to the operator node 1 through the execution cost corresponding to the operator node 2 and the reference resource data of the operator node 1. Therefore, the execution cost corresponding to the operator node 1 can be used as the estimated execution cost for the first subtree.

According to the method provided by the application, the resource data of the subtree can be accurately acquired and acquired according to the granularity of the subtree (including the granularity of the operator nodes, belonging to small granularity), but not according to the granularity of the data table or the partition of the data table (belonging to large granularity), so that the execution cost corresponding to the subtree can be accurately estimated according to the granularity of the subtree, the estimation of the execution cost corresponding to the whole candidate planning tree is finally realized, and the accuracy of the estimation of the execution cost corresponding to the candidate planning tree is improved. And the reference resource data of the subtree is acquired through the actually consumed resource data acquired by the history subtree, so that the convenience and reliability of acquiring the reference resource data of the subtree are also ensured, and the performance of executing the SQL sentence is greatly improved.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 12, the data processing apparatus 120 may include a first acquisition module 1201, a second acquisition module 1202, a matching module 1203, a determination module 1204, and an estimation module 1205.

A first obtaining module 1201, configured to obtain a candidate planning tree of an operation sentence and N sub-trees of the candidate planning tree, where N is a positive integer, and the candidate planning tree is used to execute the operation sentence;

A second obtaining module 1202, configured to obtain a dataset, where the dataset includes resource data consumed by M history subtrees in a history execution process, the history subtrees are subtrees of a history plan tree that has been executed historically, and M is a positive integer;

The matching module 1203 is configured to perform a first matching process of resource data on the N sub-trees in the dataset, where the first matching process is configured to screen matched resource data for the N sub-trees through similarities between the N sub-trees and the M history sub-trees;

a determining module 1204, configured to determine reference resource data of the N sub-trees based on a first matching result obtained by the first matching process;

And the estimating module 1205 is configured to estimate an execution cost corresponding to the candidate planning tree by using the reference resource data of the N subtrees.

In one embodiment, the matching module 1203 performs a first matching process of the resource data on the N sub-trees in the data set, including:

The matching module 1203 performs a first matching process of resource data on N subtrees in the data set based on the standard representation information of the N subtrees, including:

In one embodiment, the matching module 1203 is further configured to:

In one embodiment, the N sub-trees are each composed of one or more operator nodes of the candidate planning tree, and the determining module 1204 determines the manner of the reference resource data of the N sub-trees based on the first matching result obtained by the first matching process, including:

In one embodiment, the determining module 1204 determines, based on the number of operator nodes included in the first sub-tree, a manner of referencing resource data of the first sub-tree, including:

In one embodiment, any operator node in the first subtree is a target operator node, and the determining module 1204 determines a manner of referencing resource data of each operator node in the first subtree based on a second matching result obtained by the second matching process, including:

In one embodiment, any one of the N subtrees is a first subtree, and the matching module 1203 performs standardized representation processing on each of the N subtrees to generate standardized representation information of each subtree, including:

In one embodiment, the first obtaining module 1201 obtains a flow of N subtrees of the candidate plan tree, including:

In one embodiment, the plurality of candidate planning trees of the operation sentence is provided, and the data processing apparatus further includes an execution module 1206, where the execution module 1206 is configured to:

And executing the operation statement by adopting the target plan tree.

after executing the target plan tree, the execution module 1206 is further configured to:

In one embodiment, the manner in which the execution module 1206 obtains the resource data consumed by the second sub-tree during execution includes:

In one embodiment, the first obtaining module 1201 obtains a flow of a candidate planning tree of an operation sentence, including:

In one embodiment, the predicting module 1205 adopts reference resource data of N subtrees to predict an execution cost corresponding to the candidate plan tree, including:

According to one embodiment of the present application, the steps involved in the data processing method shown in fig. 3 may be performed by respective modules in the data processing apparatus 120 shown in fig. 12. For example, step S101 shown in FIG. 3 may be performed by the first acquisition module 1201 in FIG. 12, step S102 shown in FIG. 3 may be performed by the second acquisition module 1202 in FIG. 12, step S103 shown in FIG. 3 may be performed by the matching module 1203 in FIG. 12, step S104 shown in FIG. 3 may be performed by the determination module 1204 in FIG. 12, and step S105 shown in FIG. 3 may be performed by the estimation module 1205 in FIG. 12.

The method and the device can acquire candidate plan trees of operation sentences and N sub-trees of the candidate plan trees, wherein N is a positive integer, the candidate plan trees are used for executing the operation sentences, a data set can be acquired, the data set comprises resource data consumed by M historical subtrees in the historical execution process, the historical subtrees are subtrees of the historical plan trees which are executed in the historical mode, M is a positive integer, first matching processing of the resource data can be conducted on the N sub-trees in the data set, the first matching processing is used for screening matched resource data for the N sub-trees through similarity between the N sub-trees and the M historical subtrees, accordingly, reference resource data of the N sub-trees can be determined based on first matching results obtained through the first matching processing, and execution costs corresponding to the candidate plan trees can be estimated through the reference resource data of the N sub-trees. Therefore, the device provided by the application can acquire the reference resource data of each subtree in the candidate plan tree by taking the subtree as granularity, and the reference resource data can be obtained by matching the resource data actually consumed by the history subtree in the actual history execution process, so that the accuracy and the reliability of the reference resource data acquired by each subtree are ensured, and the execution cost corresponding to the candidate plan tree can be accurately estimated through the accurate and reliable reference resource data of each subtree.

According to an embodiment of the present application, each module in the data processing apparatus 120 shown in fig. 12 may be separately or completely combined into one or several units to form a structure, or some (some) of the units may be further split into a plurality of sub-units with smaller functions, so that the same operation may be implemented without affecting the implementation of the technical effects of the embodiment of the present application. The above modules are divided based on logic functions, and in practical applications, the functions of one module may be implemented by a plurality of units, or the functions of a plurality of modules may be implemented by one unit. In other embodiments of the present application, the data processing device 120 may also include other units, and in practical applications, these functions may also be implemented with assistance from other units, and may be implemented by cooperation of a plurality of units.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

According to one embodiment of the present application, a computer program capable of executing the steps involved in the respective methods shown in the embodiments of the present application may be run on a general-purpose computer device, which may contain a processing element and a storage element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), etc., to construct a data processing apparatus 120 as shown in fig. 12. The above-described computer program may be recorded on, for example, a computer-readable recording medium, and may be loaded into and executed in the above-described computer apparatus through the computer-readable recording medium.

Referring to fig. 13, fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the application. As shown in fig. 13, the computer device 1000 may include a processor 1001, a network interface 1004, and a memory 1005, and in some embodiments, the computer device 1000 may also include a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 13, an operating system, a network communication module, a user interface module, and a device control application program may be included in the memory 1005, which is one type of computer storage medium.

In the computer device 1000 shown in fig. 13, the network interface 1004 may provide network communication functions, while the user interface 1003 is mainly used as an interface for providing input to a user, and the processor 1001 may be used to invoke a device control application program stored in the memory 1005 to realize:

It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the data processing method in the embodiments of the present application, and may also perform the description of the data processing apparatus 120 in the embodiment corresponding to fig. 12, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

In addition, it should be noted that the present application further provides a computer readable storage medium, and the computer readable storage medium stores a computer program, which when executed by a processor, can perform the description of the data processing method in each embodiment of the present application, and thus, a detailed description thereof will not be provided herein. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer storage medium according to the present application, please refer to the description of the method embodiments of the present application.

As an example, the above-described computer program may be deployed to be executed on one computer device or on multiple computer devices that are located at one site, or on multiple computer devices that are distributed across multiple sites and interconnected by a communication network, where the multiple computer devices that are distributed across multiple sites and interconnected by a communication network may constitute a blockchain network.

The computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

The present application provides a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device performs the description of the data processing method in the embodiments of the present application, and thus, a detailed description thereof will not be provided herein. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.

The terms first, second and the like in the description and in the claims and drawings of embodiments of the application are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of data processing, the method comprising:

Acquiring a candidate plan tree of an operation sentence and N sub-trees of the candidate plan tree, wherein the candidate plan tree is composed of the N sub-trees, any sub-tree comprises one or more operator nodes in the candidate plan tree, the N sub-trees are not overlapped, N is a positive integer, and the candidate plan tree is used for executing the operation sentence;

Acquiring a data set, wherein the data set comprises resource data consumed by M historical subtrees in a historical execution process, the resource data consumed by any historical subtree comprises input and output data, consumed memory data and consumed processor data of any historical subtree, the historical subtree is a subtree of a historical plan tree which is executed in a historical manner, and M is a positive integer;

Performing first matching processing on the resource data of the N sub-trees in the data set, wherein the first matching processing is used for screening matched resource data for the N sub-trees through the similarity between the N sub-trees and the M historical sub-trees;

Determining the reference resource data of the N subtrees based on a first matching result obtained by the first matching process;

2. The method of claim 1, wherein said performing a first matching process of resource data on said N sub-trees in said dataset comprises:

respectively carrying out standardized representation processing on each subtree in the N subtrees to generate standard representation information of each subtree;

performing the first matching processing of resource data on the N subtrees in the dataset based on standard representation information of the N subtrees;

The method comprises the steps of collecting the standard representation information of each historical subtree in a data set, wherein the resource data consumed by each historical subtree in the data set respectively has a mapping relation with the standard representation information of each historical subtree, and the similarity between the standard representation information of N subtrees and the standard representation information of M historical subtrees is used for reflecting the similarity between the N subtrees and the M historical subtrees.

3. The method of claim 2, wherein any of the N subtrees is a first subtree, the dataset comprising M standard representation information for the M historical subtrees;

the performing, based on the standard representation information of the N subtrees, the first matching processing of resource data on the N subtrees in the dataset includes:

obtaining information similarity between standard representation information of the first subtree and the M pieces of standard representation information respectively;

Determining standard representation information with the maximum information similarity between the standard representation information of the M pieces of standard representation information and the standard representation information of the first subtree as undetermined representation information;

if the information similarity between the standard representation information and the undetermined representation information of the first subtree is greater than or equal to a set similarity threshold value, using the resource data with a mapping relation between the data set and the undetermined representation information as the screened resource data matched with the first subtree;

4. A method as claimed in claim 3, wherein the method further comprises:

If the information similarity between the standard representation information and the undetermined representation information of the first subtree is smaller than the similarity threshold, determining that the resource data matched with the first subtree is not screened in the data set;

5. The method of claim 4, wherein the determining the reference resource data for the N sub-trees based on the first matching result from the first matching process comprises:

And if the first matching result comprises resource data which is not screened to be matched with the first subtree, determining the reference resource data of the first subtree based on the number of operator nodes contained in the first subtree.

6. The method of claim 5, wherein the determining the reference resource data for the first sub-tree based on the number of operator nodes that the first sub-tree comprises:

when the first subtree contains one operator node of the candidate plan tree, taking resource data consumed by table data called by the first subtree as reference resource data of the first subtree;

And when the first subtree contains a plurality of operator nodes of the candidate plan tree, performing second matching processing on the resource data of the first subtree in the data set, and determining reference resource data of each operator node in the first subtree based on a second matching result obtained by the second matching processing, wherein the reference resource data of the first subtree comprises the reference resource data of each operator node in the first subtree, and the second matching processing is used for screening matched resource data for each operator node in the first subtree through similarity between each operator node in the first subtree and the M historical subtrees.

7. The method of claim 6, wherein any operator node in the first sub-tree is a target operator node, wherein determining the reference resource data for each operator node in the first sub-tree based on the second matching result from the second matching process comprises:

if the second matching result comprises the screened resource data matched with the target operator node, the screened resource data matched with the target operator node is used as the reference resource data of the target operator node;

And if the second matching result comprises resource data which is not screened to be matched with the target operator node, taking the resource data consumed by the table data called by the target operator node as the reference resource data of the target operator node.

8. The method of claim 2, wherein any one of the N sub-trees is a first sub-tree, wherein the performing standardized representation processing on each sub-tree of the N sub-trees to generate the standardized representation information of each sub-tree comprises:

Carrying out data analysis processing on the first subtree from K data fields to obtain the representation data of the first subtree under the K data fields, wherein K is a positive integer;

Performing representation processing on the representation data of the first subtree in the K data fields according to a set standard format to generate standard representation information of the first subtree;

9. The method of claim 1, wherein the process of obtaining the N sub-trees of the candidate plan tree comprises:

and taking the first type subtree and the second type subtree as the N subtrees obtained by dividing the candidate planning tree.

10. The method of claim 1, wherein there are a plurality of candidate plan trees for the operation statement, the method further comprising:

And executing the operation statement by adopting the target plan tree.

11. The method of claim 10, wherein the resource data consumed by each historical subtree in the dataset has a mapping relationship with the respective standard representation information of each historical subtree, and any subtree of the target plan tree is a second subtree;

after executing the target planning tree, the method further comprises:

12. The method of claim 11, wherein the obtaining the resource data consumed by the second sub-tree during execution comprises:

13. The method of claim 1, wherein obtaining a flow of the candidate plan tree for the operation statement comprises:

And performing conversion processing on the logic plan tree to generate a plurality of candidate plan trees of the operation statement.

14. The method of claim 1, wherein the estimating the execution cost corresponding to the candidate planning tree using the reference resource data of the N subtrees comprises:

And calling the cost estimation model to estimate the execution cost corresponding to the candidate plan tree based on the reference resource data of the N subtrees.

15. A data processing apparatus, the apparatus comprising:

The first acquisition module is used for acquiring a candidate plan tree of an operation sentence and N sub-trees of the candidate plan tree, wherein the candidate plan tree is composed of the N sub-trees, any sub-tree comprises one or more operator nodes in the candidate plan tree, the N sub-trees are not overlapped, N is a positive integer, and the candidate plan tree is used for executing the operation sentence;

The second acquisition module is used for acquiring a data set, wherein the data set comprises resource data consumed by M historical subtrees in the historical execution process, the resource data consumed by any historical subtree comprises input and output data, consumed memory data and consumed processor data of any historical subtree, the historical subtree is a subtree of a historical plan tree which is executed in a historical manner, and M is a positive integer;

The matching module is used for carrying out first matching processing on the resource data of the N subtrees in the data set, and the first matching processing is used for screening matched resource data for the N subtrees through the similarity between the N subtrees and the M historical subtrees;

The determining module is used for determining the reference resource data of the N subtrees based on a first matching result obtained by the first matching process;

16. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of claims 1-14.

17. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1-14.

18. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded by a processor and to perform the steps of the method according to any of claims 1-14.