CN119847736B

Movatterモバイル変換

Info

Publication number: CN119847736B
Application number: CN202411880942.3A
Authority: CN
Inventors: 郝放; 郭沙
Original assignee: Shenzhen Xingsheng Digital Technology Co ltd
Current assignee: Shenzhen Xingsheng Digital Technology Co ltd
Priority date: 2024-12-19
Filing date: 2024-12-19
Publication date: 2025-08-26
Anticipated expiration: 2044-12-19
Also published as: CN119847736A

Abstract

The invention discloses a heterogeneous computing power fusion and dynamic optimization distribution method and a system, relates to the technical field of computing power distribution, and aims to solve the problem that a receiving task cannot be fused with heterogeneous computing power better. The invention analyzes the real-time data aiming at the monitoring index of each node, which is beneficial to realizing more refined management of the computational power resources. By analyzing the data, the problems of unbalanced resource use, excessive node load and the like can be found, and the accurate mapping between the data is realized by definitely calculating the key fields in the task allocation data and the calculation node connection data. This helps ensure that tasks are properly assigned to nodes with the proper resources and status. After the data mapping is completed, network layer fusion is performed, which helps to achieve more efficient task execution and resource utilization in computing power distribution.

Description

Heterogeneous calculation force fusion and dynamic optimization distribution method and system

Technical Field

The invention relates to the technical field of calculation force distribution, in particular to a heterogeneous calculation force fusion and dynamic optimization distribution method and system.

Background

Computing power allocation refers to the process of reasonably allocating the processing power required for a computing task to the available computing resources.

The Chinese patent publication No. CN116862126A discloses a calculation power distribution method and a calculation power distribution device under the fusion of an energy network and a calculation power network, which mainly uses the green electricity proportion and the residual calculation power of each calculation power node in the calculation power distribution, wherein the calculation demand instruction comprises a task to be calculated and calculation power requirement data, the current distributable node is determined according to the green electricity proportion and the residual calculation power of each calculation power node in the calculation power distribution, the calculation demand instruction is distributed to the distributable node, and the distributable node calculates according to the calculation demand instruction, and the patent document solves the problem of calculation power distribution but still has the following problems:

1. No targeted load policy adjustment is performed according to the specific situation of the actual request task, thereby resulting in reduced load capacity of the data.

2. The received task is not fused with the current network layer by the targeted nodes, so that the computing power data fusion effect is poor.

3. The implementation process is not monitored in real time, so that abnormal data in the implementation process cannot be quickly pre-warned.

Disclosure of Invention

The invention aims to provide a heterogeneous computing power fusion and dynamic optimization distribution method and a system, which are used for carrying out real-time data analysis on monitoring indexes of each node, thereby being beneficial to realizing more refined computing power resource management. By analyzing the data, the problems of unbalanced resource use, excessive node load and the like can be found, and the accurate mapping between the data is realized by definitely calculating the key fields in the task allocation data and the calculation node connection data. This helps ensure that tasks are properly assigned to nodes with the proper resources and status. After the data mapping is completed, network layer fusion is carried out, which is helpful for realizing more efficient task execution and resource utilization in computing power distribution, and can solve the problems in the prior art.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the heterogeneous calculation force fusion and dynamic optimization distribution method comprises the following steps:

s1, confirming central computing capacity, namely collecting hardware configuration information in an acquisition interface, processing the collected hardware configuration information, extracting key parameters, performing benchmark test according to the extracted key parameters, and obtaining real-time computing capacity data after the benchmark test;

S2, load balancing strategy formulation, namely analyzing load conditions according to the real-time computing capacity data and the standard computing capacity data, and performing load balancing strategy formulation according to the load conditions, wherein the real-time computing capacity strategy data is obtained after the load balancing strategy formulation is completed;

S3, node task allocation, namely performing task demand analysis on the received task request, performing task allocation of the nodes on the analyzed task by utilizing real-time computing capacity strategy data, and obtaining computing power task allocation data after the task allocation of the nodes is completed;

S4, node information cooperative processing, namely establishing a node communication mechanism for each node in the computing power task allocation data, and obtaining computing power node connection data after the establishment of the communication mechanism is completed;

and S5, processing task network fusion management, namely performing data mapping on the computing task distribution data and the computing node connection data, and performing network layer fusion after the data mapping is completed, so as to obtain standard computing task distribution data after the network layer fusion.

Preferably, for the step S1, collecting the hardware configuration information in the acquisition interface, processing the collected hardware configuration information, extracting key parameters, and performing a benchmark test according to the extracted key parameters, including:

The hardware configuration information in the acquisition interface comprises processor information, memory information, storage equipment information, network interface information, a computing unit and resource use information;

After the hardware configuration information is collected, extracting key parameters of the hardware configuration information;

The key parameters of the processor information comprise a core number, a thread number, a main frequency, a maximum acceleration frequency and a cache size, the key parameters of the memory information comprise a total capacity, a type, a speed and a time sequence, the key parameters of the storage equipment information comprise a type, a capacity, a read-write speed and an interface type, the key parameters of the network interface information comprise a type, a speed, an MAC address and an IP address, the key parameters of the computing unit comprise a model number, a CUDA core number, a video memory size, a video memory type and a computing capacity, and the key parameters of the resource use information comprise a CPU use rate, a memory use rate, a disk frequency, a disk speed and a network flow;

extracting key parameters of the hardware configuration information and then performing a benchmark test;

And confirming the test result according to the reference test result, and generating real-time computing capacity data of the confirmed test result.

Preferably, for the load situation analysis according to the real-time computing capability data and the standard computing capability data in the step S2, the load balancing policy formulation according to the load situation includes:

retrieving standard computing capacity data from a database, wherein the standard computing capacity data is a standard expected performance index or a statistical average value of historical performance data;

performing data comparison on all performance indexes with the same attribute in the real-time computing capacity data and the standard computing capacity data, and calculating a difference value of each performance index according to a data comparison result;

carrying out load condition analysis according to the difference value;

the load condition is divided into normal load, light load and heavy load, wherein the normal load is the case when the percentage of the difference value is in the range of 80% -100%, the light load is the case when the percentage of the difference value is less than 80%, and the heavy load is the case when the percentage of the difference value is greater than 100%;

Constructing a load analysis matrix according to the analyzed load condition, and marking load balancing operation according to the constructed load matrix;

making a load balancing strategy according to the marked load balancing operation in the load analysis matrix, wherein the load balancing strategy comprises task migration, flow regulation and resource redistribution;

And after the load balancing strategy is formulated, generating the real-time computing capacity strategy data.

Preferably, task demand analysis is performed on the task request received in the step S3, and the task to be analyzed is subjected to task allocation of the node by using the real-time computing capability policy data, including:

after the computing power network receives the task request of the upper layer application, analyzing the task requirement of the task request;

The parsed task demands include task type, resource demand, execution time, priority, and data dependency;

matching calculation is carried out on the task demand data by using a task matching calculation method, and a matching plan of the task demand data is obtained after the matching calculation is completed;

sorting matching data in the real-time computing capacity policy data, wherein the matching data comprises node IDs, available resources, current loads and node states;

Distributing the matching plan to the node with the minimum load in the matching data by using a greedy algorithm;

When the nodes in the matching data are overloaded, the dynamic load balancing is utilized to redistribute part of tasks to the nodes with lower loads;

And obtaining the calculation task allocation data after the matching plan is allocated to the matching data.

Preferably, the establishing of the node communication mechanism is performed for each node in the computing power task allocation data in S4, including:

identifying the node ID of each node in the calculation task allocation data;

selecting a communication protocol according to the node ID, wherein the communication protocol comprises TCP, UDP and MPI;

Confirming parameters of the selected communication protocol after the communication protocol is selected, wherein the parameters comprise port numbers, communication modes and data transmission formats;

after the parameter confirmation of the communication protocol is completed, establishing communication connection between nodes, wherein the communication connection comprises heartbeat mechanism connection or bidirectional connection;

the heartbeat mechanism is connected to send heartbeat packets to each node at fixed time, and the survival state of the nodes is confirmed according to the heartbeat packet parameters;

and after the communication connection confirmation is completed, obtaining connection data of the receiving task and the computing power node in the computing network.

Preferably, setting a heartbeat packet sending time interval corresponding to the heartbeat mechanism includes:

Extracting the total number of nodes;

Extracting the data transmission time length of a data packet of a unit data volume corresponding to each node, wherein the value range of the unit data volume is 10BM-100BM;

extracting the data transmission rate corresponding to each node;

Under the current data transmission environment corresponding to the node, acquiring the data transmission time length of the data packet of the unit data quantity of the node and the data transmission rate corresponding to each node, and acquiring the heartbeat packet transmission time interval coefficient corresponding to each node;

The heartbeat packet sending time interval coefficient corresponding to each node is obtained through the following formula:

wherein F represents a heartbeat packet transmission time interval coefficient corresponding to each node, n represents the number of data transmission times corresponding to each node, B_i represents the data transmission rate corresponding to the ith data transmission corresponding to each node, B_e represents the data transmission rate of the data packet of the unit data amount corresponding to each node, T_i represents the data transmission time length of the ith data transmission corresponding to each node, T_e represents the data transmission time length of the data packet of the unit data amount corresponding to the node, x_i represents the number of the unit data amount contained in the ith data transmission corresponding to each node, and T_s represents the theoretical data transmission time length of the data packet of the unit data amount corresponding to the maximum data transmission rate;

acquiring heartbeat packet transmission time intervals corresponding to all nodes according to the heartbeat packet transmission time interval coefficients;

the heartbeat packet sending time interval is obtained through the following formula:

The method comprises the steps of G representing a heartbeat packet sending time interval, G₀ representing a preset initial heartbeat packet sending time interval, m representing the total number of nodes, F_i representing a heartbeat packet sending time interval coefficient corresponding to an ith node, F_z representing a heartbeat packet sending time interval coefficient intermediate value corresponding to m nodes, F_cmax representing the maximum value of heartbeat packet sending time interval coefficient difference values between every two nodes with data transmission interactive connection corresponding to m nodes, F_b representing heartbeat packet sending time interval coefficient standard deviation corresponding to m nodes, and F_cb representing heartbeat packet sending time interval coefficient difference value standard deviation between every two nodes with data transmission interactive connection;

Sending heartbeat packets to each node according to the heartbeat packet sending time interval;

And monitoring the node survival rate and the node revival rate corresponding to each heartbeat packet transmission in real time, and adjusting the heartbeat packet transmission time interval according to the node survival rate and the node revival rate.

Preferably, the node survival rate and the node revival rate corresponding to each heartbeat packet sending are monitored in real time, and the heartbeat packet sending time interval is adjusted according to the node survival rate and the node revival rate, including:

Extracting node survival rate and node revival rate which are correspondingly obtained by sending heartbeat packets each time;

comparing the node survival rate with a preset node survival rate threshold;

When the node survival rate is lower than a preset node survival rate threshold value, the node survival rate and the node revival rate are utilized to adjust the heartbeat packet sending time interval, and the adjusted heartbeat packet sending time interval is obtained;

the adjusted heartbeat packet sending time interval is obtained through the following formula:

The method comprises the steps of (1) determining a heartbeat packet sending time interval after adjustment, wherein G_t represents the heartbeat packet sending time interval, G₀ represents a preset initial heartbeat packet sending time interval, k represents the heartbeat packet sending times before the node survival rate is lower than a preset node survival rate threshold value, P_c represents the node survival rate corresponding to the node survival rate lower than the preset node survival rate threshold value, P_f represents the node reactivation rate corresponding to the node survival rate lower than the preset node survival rate threshold value, P_ci represents the node survival rate corresponding to the ith heartbeat packet sending, P_fi represents the node reactivation rate corresponding to the ith heartbeat packet sending, and P_cy represents the preset node survival rate threshold value;

and sending the heartbeat packet to each node according to the adjusted heartbeat packet sending time interval.

Preferably, in S5, the data mapping is performed on the computing task allocation data and the computing node connection data, and the network layer fusion is performed after the data mapping is completed, including:

Confirming key fields in the computing power task allocation data and the computing power node connection data, wherein the key fields of the computing power task allocation data comprise task IDs, allocated node IDs, task types, required resources, priorities and data dependencies;

Invoking a mapping rule from a database, and carrying out data mapping on the computing task allocation data and the computing node connection data according to the mapping rule;

the data mapping flow comprises extracting task information from the computing power task allocation data according to the task ID and the allocated node ID, finding out corresponding node data from the computing power node connection data, and comparing;

When the comparison result meets the mapping rule, mapping the task and the node, and obtaining task mapping data after the mapping is completed;

and when the comparison result does not meet the mapping rule, retrying the mapping task or reevaluating the task allocation strategy from the standby node until the comparison result meets the mapping rule.

Preferably, in S5, the data mapping is performed on the computing task allocation data and the computing node connection data, and network layer fusion is performed after the data mapping is completed, which further includes:

The task mapping data are subjected to network layer fusion, before the network layer fusion is carried out, the task mapping data are subjected to fusion data set arrangement, and the fusion data set comprises the steps of extracting relevant information of each task from the mapped task mapping data, and extracting the latest state and resource utilization condition of each node from the computing node connection data;

Updating node data according to task execution states in the fusion data set, wherein the node data comprises the task number, the current load and available resources allocated to each node, and when the load of the node exceeds a preset range, the node data is dynamically adjusted;

and carrying out data set integration on the data after node data updating or dynamic adjustment, and obtaining standard calculation task allocation data after the data set integration.

A heterogeneous computing force fusion and dynamic optimization distribution system comprising:

A task execution monitoring unit for:

when the standard calculation task allocation data is used for carrying out task implementation, the standard calculation task allocation data is monitored in real time;

Before real-time monitoring is carried out on the standard calculation task allocation data, monitoring indexes are confirmed, wherein the monitoring indexes comprise task states, execution progress, resource use conditions, node loads and abnormal events;

In the process of carrying out tasks by standard calculation task allocation data, monitoring the monitoring index of each node in real time;

real-time data analysis of standard calculation task allocation data is carried out according to the monitoring index of each node;

performing early warning prompt according to the analyzed real-time data condition, wherein the early warning prompt is that the threshold value of the real-time data exceeds a preset range, and marking the real-time data exceeding the preset range as early warning data;

the early warning data is generated by adjusting reports according to the abnormality degree;

and finally, transmitting the early warning data and the adjustment report to a display terminal for visual display.

Compared with the prior art, the invention has the following beneficial effects:

1. the heterogeneous computing power fusion and dynamic optimization distribution method and system provided by the invention have the advantages that the selected communication protocol has a reliable transmission mechanism and error recovery capability, the integrity and accuracy of data in the transmission process can be ensured, the safety of data transmission can be further enhanced through safety measures such as encryption and authentication, the privacy and the safety of data are protected, the computing power networks of different scales and types can be conveniently adapted through adjusting the parameters and the configuration of the communication protocol, partial tasks are redistributed to nodes with lower loads by utilizing a dynamic load balancing technology, the dynamic adjustment mechanism is favorable for keeping the stability and the high efficiency of the computing power network, and the performance bottleneck caused by single-point overload is avoided.

2. According to the heterogeneous computing power fusion and dynamic optimization distribution method and system, accurate mapping between data is achieved through defining key fields in computing power task distribution data and computing power node connection data. This helps ensure that tasks are properly assigned to nodes with the proper resources and status. After the data mapping is completed, network layer fusion is performed, which helps to achieve more efficient task execution and resource utilization in computing power distribution.

3. According to the heterogeneous computing power fusion and dynamic optimization distribution method and system, real-time data analysis is carried out on the monitoring index of each node, so that finer computing power resource management is facilitated. By analyzing the data, the problems of unbalanced resource use, excessive node load and the like can be found, so that powerful support is provided for optimizing calculation power distribution and improving resource utilization rate.

Drawings

FIG. 1 is a schematic diagram of the heterogeneous computing force fusion and dynamic optimization distribution steps of the present invention;

FIG. 2 is a schematic diagram of the heterogeneous computing force fusion and dynamic optimization distribution flow of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve the problem that in the prior art, no targeted load policy adjustment is performed according to the specific situation of the actual request task, thereby resulting in a decrease in the load capacity of data, referring to fig. 1 and 2, the present embodiment provides the following technical solutions:

the method comprises the steps of collecting hardware configuration information in detail and performing benchmark test, so that potential performance problems or hardware faults can be found and solved in time;

The computing power resources can be reasonably allocated and scheduled through the formulation and execution of the load balancing strategy, so that the waste and bottleneck of the resources are avoided;

the dynamic adjustment mechanism is helpful for maintaining the stability and the high efficiency of the power network and avoiding the performance bottleneck caused by single-point overload;

The power calculation network can be conveniently adapted to different scales and types by adjusting the parameters and the configuration of the communication protocol, and the power calculation resource can be intelligently scheduled according to the performance and the load condition of the node;

S5, processing task network fusion management, namely performing data mapping on the computing task distribution data and computing node connection data, and performing network layer fusion after the data mapping is completed, wherein standard computing task distribution data is obtained after the network layer fusion;

the adjustability of the mapping rules enables the system to be customized according to actual requirements, and therefore flexibility and accuracy of task allocation are improved.

Aiming at the hardware configuration information in the acquisition interface in the step S1, the collected hardware configuration information is processed and key parameters are extracted, and the benchmark test is carried out according to the extracted key parameters, comprising the following steps:

Aiming at the load situation analysis according to the real-time computing capacity data and the standard computing capacity data in the step S2, the load balancing strategy formulation according to the load situation comprises the following steps:

carrying out load condition analysis according to the difference value;

Specifically, the current load condition can be rapidly identified through the comparison of the real-time computing capacity data and the standard computing capacity data, so that the adjustment can be timely made. The establishment and execution of the load balancing strategy are based on accurate data analysis and judgment, so that efficient utilization of resources is ensured, statistical average values or standard expected performance indexes are used as standard computing capacity data, and a reliable reference is provided for comparison of the real-time computing capacity data. By calculating the percentage of the difference value, the load conditions (normal load, light load and heavy load) can be accurately divided, accurate information is provided for the subsequent load balancing strategy formulation, and various load balancing strategies including task migration, flow regulation and resource redistribution are formulated according to different load conditions, so that the method can adapt to various complex network environments and service demands. The flexibility of the load balancing strategy is also embodied in that the strategy can be continuously adjusted and optimized according to the generation of the real-time computing capacity strategy data so as to adapt to the dynamic change of the network environment, and the computing power resources can be reasonably allocated and scheduled through the formulation and execution of the load balancing strategy, so that the waste and bottleneck of the resources are avoided. And under heavy load, the pressure can be relieved through resource redistribution, the normal operation of the service is ensured, and the intelligent decision support is provided for the subsequent load balancing strategy formulation by constructing a load analysis matrix and carrying out load balancing operation marking.

Performing task demand analysis on the task request received in the step S3, performing task allocation on nodes on the analyzed task by utilizing the real-time computing capacity policy data, and comprising the following steps:

Specifically, the detailed task demand analysis is performed on the received task request, and the detailed task demand analysis includes key information such as task type, resource demand, execution time, priority, data dependency and the like. The comprehensive analysis is helpful to ensure more accurate and efficient subsequent task allocation, the scheme can quickly generate a matching plan according to task demand data through a task matching calculation method, then, the matching plan is allocated to a node with the smallest load by using a greedy algorithm, the utilization of computational power resources is facilitated to be optimized, waiting time and calculation cost are reduced, and when overload conditions occur to the node in the matching data, the scheme can redistribute partial tasks to the node with lower load by using a dynamic load balancing technology. The dynamic adjustment mechanism is beneficial to maintaining the stability and the high efficiency of the computational power network, avoiding the performance bottleneck caused by single-point overload, and remarkably improving the utilization rate of computational power resources through accurate task matching and distribution and dynamic load balancing adjustment. The method is beneficial to reducing the computational effort cost, improving the overall calculation efficiency, and dynamically adjusting the distribution of computational effort resources according to the change of task demands, thereby enhancing the flexibility and the expandability of the system. This enables the power network to better accommodate AI application requirements of different scales and types.

In order to solve the problem that in the prior art, the received task is not fused with the current network layer by the targeted node, so that the effect of computing force data fusion is poor, referring to fig. 1 and 2, the present embodiment provides the following technical scheme:

Establishing a node communication mechanism for each node in the computing task allocation data in the S4, wherein the node communication mechanism comprises the following steps:

identifying the node ID of each node in the calculation task allocation data;

Specifically, by identifying the node ID of each node in the computing task allocation data, the system can accurately identify and manage each node, ensure that computing resources are efficiently utilized, select communication protocols (such as TCP, UDP and MPI) according to the node IDs, and confirm parameters of the communication protocols (such as port numbers, communication modes and data transmission formats), so that the system can flexibly adapt to different application scenes and computing requirements, establish heartbeat mechanism connection, enable the system to send heartbeat packets to each node at regular time, timely confirm the survival state of the node, ensure the reliability and stability of computing task allocation, and realize bidirectional real-time transmission of tasks and data between the nodes, improve the response speed and flexibility of computing task allocation, enable the heartbeat mechanism connection to discover and process node faults in time, and ensure the stability and availability of computing network.

When the node fails, the system can quickly adjust the calculation force distribution strategy, schedule the task to other available nodes, avoid the task interruption and the data loss, select the communication protocol with reliable transmission mechanism and error recovery capability, ensure the integrity and accuracy of the data in the transmission process, further enhance the safety of the data transmission through security measures such as encryption and authentication, protect the user privacy and the data safety, and can conveniently adapt to calculation force networks of different scales and types by adjusting the parameters and configuration of the communication protocol, intelligently schedule calculation force resources according to the performance and load condition of the node, and ensure the efficient and balanced processing of the task.

Specifically, setting a heartbeat packet sending time interval corresponding to the heartbeat mechanism includes:

Extracting the total number of nodes;

extracting the data transmission rate corresponding to each node;

The technical effect of the technical scheme is that the transmission time interval of the heartbeat packet is dynamically adjusted by comprehensively considering the factors such as the number of nodes, the data transmission rate, the data transmission time length and the like, so that the network can be ensured to be in a connection state, unnecessary heartbeat packet transmission is avoided, and the network communication efficiency is improved. The primary purpose of the heartbeat mechanism is to monitor the survival status of the node. According to the scheme, the node survival rate and the node revival rate are monitored in real time, and the heartbeat packet sending time interval is dynamically adjusted according to the information, so that node faults can be found and processed in time, and the stability and reliability of a network are enhanced. Frequent transmission of heartbeat packets may occupy network resources. According to the scheme, the sending time interval of the heartbeat packet is dynamically adjusted, so that the sending frequency of the heartbeat packet can be reduced on the premise of ensuring the network stability, and the utilization of network resources is optimized. The proposal considers the data transmission rate and the data transmission time length of different nodes, and can dynamically adjust the heartbeat packet sending time interval according to the actual condition of each node. This enables the mechanism to adapt to different network environments, improving the flexibility and adaptability of network communications. By optimizing the sending strategy of the heartbeat packet, network delay and bandwidth occupation can be reduced, thereby improving user experience. Such optimization is particularly important in application scenarios where real-time requirements are high.

In summary, the technical scheme realizes optimization in the aspects of network communication efficiency, stability, resource utilization, network adaptability, user experience and the like by dynamically adjusting the heartbeat packet sending time interval. This helps to improve the overall performance of network communications, meeting the ever-increasing demands of network communications.

Specifically, the method for monitoring the node survival rate and the node revival rate corresponding to each heartbeat packet sending in real time, and adjusting the heartbeat packet sending time interval according to the node survival rate and the node revival rate includes:

comparing the node survival rate with a preset node survival rate threshold;

The technical effect of the technical scheme is that the abnormal conditions in the network, such as node faults or unstable network, can be timely found through monitoring the node survival rate and the node reviving rate in real time. When the node survival rate is lower than a preset threshold value, the node state can be monitored more frequently by adjusting the heartbeat packet sending time interval, so that measures are taken in time to recover the network connectivity, and the reliability of network communication is improved. Under the conditions of stable network and higher node survival rate, the network resource can be saved by reducing the sending frequency of the heartbeat packet. When the node survival rate is reduced, the node fault can be detected more quickly by increasing the sending frequency of the heartbeat packet, and the resource waste on invalid communication is avoided. Such dynamic adjustment mechanisms help optimize the utilization of network resources. According to the scheme, the heartbeat packet sending time interval is dynamically adjusted according to the actual survival rate and the revival rate of the nodes, so that the self-adaptability of the system is embodied. This adaptation allows the system to automatically adjust policies to changes in the network environment, thereby providing more flexibility in coping with various network conditions. By timely detecting and processing node faults, the scheme is beneficial to reducing network interruption and delay and improving user experience. Such optimization is particularly important in application scenarios with high real-time requirements, such as online games, video conferences, etc. By automatically adjusting the heartbeat packet sending time interval, the scheme can reduce the frequency of manual intervention and reduce the cost of network maintenance. Meanwhile, the network problems can be found and processed in time, so that downtime and loss caused by network faults can be reduced.

In summary, by dynamically adjusting the heartbeat packet sending time interval, the technical scheme improves the reliability of network communication, optimizes the resource utilization, enhances the self-adaptability of the system, improves the user experience and reduces the maintenance cost. The technical effects enable the scheme to have wide application prospects in the field of network communication.

Aiming at the data mapping of the power calculation task allocation data and the power calculation node connection data in the S5, the network layer fusion is carried out after the data mapping is completed, and the method comprises the following steps:

Specifically, by defining key fields in the computing task allocation data and the computing node connection data, accurate mapping between the data is realized. This helps ensure that tasks are properly assigned to nodes with the proper resources and status. After the data mapping is completed, network layer fusion is carried out, which is helpful for realizing more efficient task execution and resource utilization in calculation power distribution, and the adjustability of the mapping rule enables the system to be customized according to actual requirements, thereby improving the flexibility and accuracy of task distribution. When the comparison result does not meet the mapping rule, the system can re-try the mapping task or re-evaluate the task allocation strategy from the standby node, so that the adaptability and the robustness of the system are enhanced, the fusion data set contains the task information and the latest data of the node state, and the system is helped to grasp the network state in real time, so that a more intelligent decision is made. When the node load exceeds the preset range, the system dynamically adjusts, so that overload and performance degradation are prevented, the resource utilization rate is improved, and standard calculation task allocation data can be generated by integrating the task mapping data and the node data, so that more efficient calculation allocation is realized. The optimized calculation force distribution can reduce resource waste and improve task execution efficiency, thereby meeting the calculation force demands of more users.

In order to solve the problem that in the prior art, when a receiving task performs task implementation, real-time monitoring is not performed on an implementation process, so that abnormal data in the implementation process cannot be quickly pre-warned, referring to fig. 1 and 2, the embodiment provides the following technical scheme:

A task execution monitoring unit for:

Specifically, by monitoring the standard calculation task distribution data in real time, the system can acquire key information such as task state, execution progress, resource use condition and the like in real time, so that effective management and scheduling of calculation resources are realized. The early warning mechanism is beneficial to identifying and coping with possible risks in advance, preventing the problem from being enlarged, generating early warning data and adjustment reports, providing timely decision support for management staff, helping the management staff to take measures quickly, reducing the risks, and enabling the early warning data and the adjustment reports to be visually displayed through the display terminal, so that the management staff can intuitively know the running state of the system and the existing problems. The visual display mode not only improves the readability of information, but also reduces the understanding and operation difficulty, so that a manager can make decisions faster, and can conduct real-time data analysis on the monitoring index of each node, which is beneficial to realizing finer calculation power resource management. By analyzing the data, the problems of unbalanced resource use, excessive node load and the like can be found, so that powerful support is provided for optimizing calculation power distribution and improving resource utilization rate, and a real-time monitoring and early warning mechanism is helpful for timely finding and repairing faults and anomalies in the system, so that the stability and reliability of the system are improved. The mechanism can also reduce the loss caused by system breakdown or data loss and the like, and ensure the continuity of service and the integrity of data.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made hereto without departing from the spirit and principles of the present invention.

Claims

Translated fromChinese

1.异构算力融合与动态优化分配方法，其特征在于，包括如下步骤：1. A method for integrating heterogeneous computing power and dynamically optimizing its allocation, characterized by comprising the following steps:

S1：中心计算能力确认：将采集接口中的硬件配置信息进行收集，并将收集的硬件配置信息进行处理和关键参数提取，根据提取的关键参数进行基准测试，基准测试后得到实时计算能力数据；S1: Confirmation of central computing power: Collect hardware configuration information from the acquisition interface, process the collected hardware configuration information and extract key parameters, perform benchmark testing based on the extracted key parameters, and obtain real-time computing power data after the benchmark testing;

S2：负载均衡策略制定：根据实时计算能力数据与标准计算能力数据进行负载情况分析，根据负载情况进行负载均衡策略制定，负载均衡策略制定完成后得到实时计算能力策略数据；S2: Load balancing strategy formulation: Load analysis is performed based on real-time computing power data and standard computing power data. A load balancing strategy is formulated based on the load situation. After the load balancing strategy is formulated, the real-time computing power strategy data is obtained.

S3：节点任务分配：将接收的任务请求进行任务需求分析，利用实时计算能力策略数据将分析的任务进行节点的任务分配，节点的任务分配完成后得到算力任务分配数据；S3: Node task allocation: Analyze the task requirements of the received task requests and use the real-time computing power strategy data to allocate the analyzed tasks to the nodes. After the node task allocation is completed, the computing power task allocation data is obtained.

S4：节点信息协同处理：将算力任务分配数据中的每个节点进行节点通信机制建立，通信机制建立完成后得到算力节点连接数据；S4: Node information collaborative processing: Establish a node communication mechanism for each node in the computing power task allocation data. After the communication mechanism is established, the computing power node connection data is obtained;

S5：处理任务网络融合管理：将算力任务分配数据与算力节点连接数据进行数据映射，数据映射完成后进行网络层融合，网络层融合后得到标准算力任务分配数据；S5: Processing task network fusion management: Map the computing power task allocation data with the computing power node connection data. After the data mapping is completed, perform network layer fusion to obtain the standard computing power task allocation data.

针对S5中将算力任务分配数据与算力节点连接数据进行数据映射，数据映射完成后进行网络层融合，包括：In S5, the computing task allocation data and computing node connection data are mapped. After the data mapping is completed, network layer fusion is performed, including:

将算力任务分配数据与算力节点连接数据中的关键字段进行确认，其中，算力任务分配数据的关键字段包括任务ID、分配的节点ID、任务类型、所需资源、优先级和数据依赖性；算力节点连接数据的关键字段包括节点ID、可用资源数据、当前负载状态、节点状态和通信信息；Verify the key fields in the computing task allocation data and computing node connection data. The key fields in the computing task allocation data include task ID, assigned node ID, task type, required resources, priority, and data dependencies; the key fields in the computing node connection data include node ID, available resource data, current load status, node status, and communication information.

从数据库中调取映射规则，根据映射规则进行算力任务分配数据与算力节点连接数据的数据映射；Retrieve mapping rules from the database and perform data mapping between computing task allocation data and computing node connection data according to the mapping rules;

数据映射流程包括根据任务ID和分配的节点ID从算力任务分配数据中提取出任务信息，再从算力节点连接数据中找到对应的节点数据，并进行对比；The data mapping process involves extracting task information from the computing task allocation data based on the task ID and the assigned node ID, then finding the corresponding node data from the computing node connection data and comparing them.

当比对结果满足映射规则后，将任务与节点进行映射，映射完成后得到任务映射数据；When the comparison result meets the mapping rules, the task is mapped to the node, and the task mapping data is obtained after the mapping is completed;

当比对结果未满足映射规则后，从备用节点中重新尝试映射任务或重新评估任务分配策略，直至比对结果满足映射规则。When the comparison result does not meet the mapping rules, the mapping task is retried from the standby node or the task allocation strategy is re-evaluated until the comparison result meets the mapping rules.

2.根据权利要求1所述的异构算力融合与动态优化分配方法，其特征在于，针对S1步骤中将采集接口中的硬件配置信息进行收集，并将收集的硬件配置信息进行处理和关键参数提取，根据提取的关键参数进行基准测试，包括：2. The heterogeneous computing power integration and dynamic optimization allocation method according to claim 1 is characterized in that, in step S1, the hardware configuration information in the acquisition interface is collected, the collected hardware configuration information is processed and key parameters are extracted, and a benchmark test is performed based on the extracted key parameters, including:

采集接口中的硬件配置信息包括处理器信息、内存信息、存储设备信息、网络接口信息、计算单元和资源使用信息；The hardware configuration information in the acquisition interface includes processor information, memory information, storage device information, network interface information, computing unit and resource usage information;

硬件配置信息收集完成后，将硬件配置信息的关键参数进行提取；After the hardware configuration information is collected, the key parameters of the hardware configuration information are extracted;

其中，处理器信息的关键参数包括核心数、线程数、主频、最大加速频率和缓存大小；内存信息的关键参数包括总容量、类型、速度和时序；存储设备信息的关键参数包括类型、容量、读写速度和接口类型；网络接口信息的关键参数包括类型、速度、MAC地址和IP地址；计算单元的关键参数包括型号、CUDA核心数、显存大小、显存类型和计算能力；资源使用信息的关键参数包括CPU使用率、内存使用率、磁盘频率、磁盘速度和网络流量；Among them, the key parameters of processor information include the number of cores, number of threads, main frequency, maximum acceleration frequency and cache size; the key parameters of memory information include total capacity, type, speed and timing; the key parameters of storage device information include type, capacity, read and write speed and interface type; the key parameters of network interface information include type, speed, MAC address and IP address; the key parameters of computing unit include model, number of CUDA cores, video memory size, video memory type and computing power; the key parameters of resource usage information include CPU usage, memory usage, disk frequency, disk speed and network traffic;

硬件配置信息的关键参数提取后进行基准测试；Extract key parameters of hardware configuration information and then conduct benchmark tests;

根据基准测试结果进行测试结果确认，并将确认的测试结果进行实时计算能力数据生成。The test results are confirmed based on the benchmark test results, and the confirmed test results are used to generate real-time computing power data.

3.根据权利要求2所述的异构算力融合与动态优化分配方法，其特征在于，针对S2步骤中根据实时计算能力数据与标准计算能力数据进行负载情况分析，根据负载情况进行负载均衡策略制定，包括：3. The heterogeneous computing power integration and dynamic optimization allocation method according to claim 2 is characterized in that the load analysis performed in step S2 based on the real-time computing power data and the standard computing power data, and the load balancing strategy formulated based on the load conditions, include:

从数据库中将标准计算能力数据进行调取，标准计算能力数据为标准的预期性能指标或者是历史性能数据的统计平均值；Retrieving standard computing capacity data from a database, where the standard computing capacity data is a standard expected performance indicator or a statistical average of historical performance data;

将实时计算能力数据与标准计算能力数据中属性相同的各项性能指标进行数据对比，根据数据对比结果计算每个性能指标的差异值；Compare the real-time computing capability data with the performance indicators with the same attributes in the standard computing capability data, and calculate the difference value of each performance indicator based on the data comparison results;

根据差异值进行负载情况分析；Perform load analysis based on the difference value;

其中，负载情况划分为正常负载、轻负载和重负载；当差异值的百分比在80%-100%范围内时为正常负载；当差异值的百分比小于80%时为轻负载；当差异值的百分比大于100%时为重负载；The load conditions are divided into normal load, light load and heavy load. When the percentage of the difference value is within the range of 80%-100%, it is normal load; when the percentage of the difference value is less than 80%, it is light load; when the percentage of the difference value is greater than 100%, it is heavy load.

根据分析的负载情况进行的负载分析矩阵构建，根据构建的负载矩阵进行负载均衡操作标记；Construct a load analysis matrix based on the analyzed load conditions, and mark load balancing operations based on the constructed load matrix;

根据负载分析矩阵中标记的负载均衡操作进行负载均衡策略制定，负载均衡策略包括任务迁移、流量调节和资源重分配；Formulate load balancing strategies based on the load balancing operations marked in the load analysis matrix. Load balancing strategies include task migration, traffic regulation, and resource reallocation.

负载均衡策略制定完成后进行实时计算能力策略数据生成。After the load balancing strategy is formulated, real-time computing power strategy data is generated.

4.根据权利要求3所述的异构算力融合与动态优化分配方法，其特征在于，针对S3步骤中将接收的任务请求进行任务需求分析，利用实时计算能力策略数据将分析的任务进行节点的任务分配，包括：4. The heterogeneous computing power integration and dynamic optimization allocation method according to claim 3 is characterized by performing task demand analysis on the task request received in step S3 and allocating the analyzed tasks to nodes using real-time computing power strategy data, including:

当算力网络接收到上层应用的任务请求后，将任务请求的任务需求进行解析；When the computing network receives a task request from an upper-layer application, it analyzes the task requirements of the task request;

解析的任务需求包括任务类型、资源需求、执行时间、优先级和数据依赖性；The parsed task requirements include task type, resource requirements, execution time, priority, and data dependencies;

利用任务匹配计算法将任务需求数据进行匹配计算，匹配计算完成后得到任务需求数据的匹配计划；The task requirement data is matched and calculated using the task matching calculation method, and a matching plan for the task requirement data is obtained after the matching calculation is completed;

将实时计算能力策略数据中的匹配数据进行整理，匹配数据包括节点ID、可用资源、当前负载和节点状态；Organize the matching data in the real-time computing capacity strategy data, including node ID, available resources, current load, and node status;

利用贪心算法将匹配计划分配给匹配数据中负载最小的节点；Use greedy algorithm to assign matching plan to the node with the least load in matching data;

当匹配数据中的节点过载时，利用动态负载均衡将部分任务重新分配到负载较低的节点；When the nodes in the matching data are overloaded, dynamic load balancing is used to redistribute some tasks to nodes with lower loads;

匹配计划分配给匹配数据后得到算力任务分配数据。After the matching plan is assigned to the matching data, the computing power task allocation data is obtained.

5.根据权利要求4所述的异构算力融合与动态优化分配方法，其特征在于，针对S4中将算力任务分配数据中的每个节点进行节点通信机制建立，包括：5. The heterogeneous computing power integration and dynamic optimization allocation method according to claim 4 is characterized in that the node communication mechanism is established for each node in the computing power task allocation data in S4, including:

将算力任务分配数据中每个节点的节点ID进行识别；Identify the node ID of each node in the computing power task allocation data;

根据节点ID选择通信协议，通信协议包括TCP、UDP和MPI；Select the communication protocol based on the node ID. The communication protocols include TCP, UDP and MPI.

通信协议选择完成后将选择的通信协议的参数进行确认，参数包括端口号、通信模式和数据传输格式；After the communication protocol is selected, the parameters of the selected communication protocol will be confirmed, including the port number, communication mode and data transmission format;

通信协议的参数确认完成后，在节点之间建立通信连接，通信连接包括心跳机制连接或双向连接；After the parameters of the communication protocol are confirmed, a communication connection is established between the nodes. The communication connection includes a heartbeat mechanism connection or a bidirectional connection.

心跳机制连接为定时发送心跳包至每个节点，根据心跳包参数确认节点的存活状态；双向连接为在节点之间建立任务和数据的双向实时传递；The heartbeat mechanism connection is to send heartbeat packets to each node at regular intervals and confirm the node's survival status based on the heartbeat packet parameters; the bidirectional connection is to establish a two-way real-time transmission of tasks and data between nodes;

通信连接确认完成后得到接收任务与计算网络中的算力节点连接数据。After the communication connection is confirmed, the connection data between the receiving task and the computing power node in the computing network is obtained.

6.根据权利要求5所述的异构算力融合与动态优化分配方法，其特征在于，设置所述心跳机制对应的心跳包发送时间间隔，包括：6. The method for heterogeneous computing power integration and dynamic optimization allocation according to claim 5, wherein setting the heartbeat packet sending time interval corresponding to the heartbeat mechanism comprises:

提取节点总数量；Extract the total number of nodes;

提取每个节点对应的单位数据量的数据包的数据传输时长，其中，所述单位数据量的取值范围为10BM-100BM；Extract the data transmission time of the data packet of the unit data volume corresponding to each node, wherein the value range of the unit data volume is 10BM-100BM;

提取每个节点对应的数据传输速率；Extract the data transmission rate corresponding to each node;

在所述节点对应的当前数据传输环境下，获取节点对单位数据量的数据包的数据传输时长和每个节点对应的数据传输速率获取每个节点对应的心跳包发送时间间隔系数；Under the current data transmission environment corresponding to the node, the data transmission duration of the node for a data packet of unit data volume and the data transmission rate corresponding to each node are obtained to obtain the heartbeat packet sending time interval coefficient corresponding to each node;

其中，所述每个节点对应的心跳包发送时间间隔系数通过如下公式获取：The heartbeat packet sending time interval coefficient corresponding to each node is obtained by the following formula:

其中，F表示每个节点对应的心跳包发送时间间隔系数；n表示每个节点对应的数据发送次数；B_i表示每个节点对应的第i次进行数据发送对应的数据传输速率；B_e表示每个节点对应接收的单位数据量的数据包的数据传输速率；T_i表示每个节点对应的第i次进行数据发送的数据传输时长；T_e表示节点对应的单位数据量的数据包的数据传输时长；x_i表示每个节点对应的第i次进行数据发送的数据量所包含的单位数据量的个数；T_s表示按照最大数据传输速度对应的单位数据量的数据包的理论数据传输时长；Wherein, F represents the heartbeat packet sending time interval coefficient corresponding to each node; n represents the number of data transmissions corresponding to each node;_Bi represents the data transmission rate corresponding to the i-th data transmission of each node;_Be represents the data transmission rate of the data packet per unit data volume received by each node;_Ti represents the data transmission duration of the i-th data transmission corresponding to each node;_Te represents the data transmission duration of the data packet per unit data volume corresponding to the node;_Xi represents the number of unit data volumes contained in the data volume of the i-th data transmission corresponding to each node;_Ts represents the theoretical data transmission duration of the data packet per unit data volume corresponding to the maximum data transmission speed;

根据所述心跳包发送时间间隔系数获取所有节点对应的心跳包发送时间间隔；Obtain the heartbeat packet sending time interval corresponding to all nodes according to the heartbeat packet sending time interval coefficient;

其中，所述心跳包发送时间间隔通过如下公式获取：The heartbeat packet sending time interval is obtained by the following formula:

其中，G表示心跳包发送时间间隔；G₀表示预设的初始心跳包发送时间间隔；m表示节点的总个数；F_i表示第i个节点对应的心跳包发送时间间隔系数；F_z表示m个节点对应的心跳包发送时间间隔系数中间值；F_cmax表示m个节点对应的具有数据传输交互连接的每两个节点之间的心跳包发送时间间隔系数差值的最大值；F_b表示m个节点对应的心跳包发送时间间隔系数标准差；F_cb表示具有数据传输交互连接的每两个节点之间的心跳包发送时间间隔系数差值标准差；Wherein, G represents the heartbeat packet sending time interval;_G0 represents the preset initial heartbeat packet sending time interval; m represents the total number of nodes;_Fi represents the heartbeat packet sending time interval coefficient corresponding to the i-th node;_Fz represents the median value of the heartbeat packet sending time interval coefficient corresponding to m nodes;_Fcmax represents the maximum value of the difference between the heartbeat packet sending time interval coefficients of every two nodes with data transmission interactive connection corresponding to m nodes;_Fb represents the standard deviation of the heartbeat packet sending time interval coefficients of m nodes;_Fcb represents the standard deviation of the difference between the heartbeat packet sending time interval coefficients of every two nodes with data transmission interactive connection;

按照所述心跳包发送时间间隔向每个节点发送心跳包；Send a heartbeat packet to each node according to the heartbeat packet sending time interval;

实时监测每次发送心跳包对应的节点存活率和节点复活率，并根据所述节点存活率和节点复活率对心跳包发送时间间隔进行调整。The node survival rate and node resurrection rate corresponding to each heartbeat packet sent are monitored in real time, and the heartbeat packet sending time interval is adjusted according to the node survival rate and node resurrection rate.

7.根据权利要求6所述的异构算力融合与动态优化分配方法，其特征在于，实时监测每次发送心跳包对应的节点存活率和节点复活率，并根据所述节点存活率和节点复活率对心跳包发送时间间隔进行调整，包括：7. The heterogeneous computing power integration and dynamic optimization allocation method according to claim 6 is characterized by real-time monitoring of the node survival rate and node resurrection rate corresponding to each heartbeat packet transmission, and adjusting the heartbeat packet transmission interval according to the node survival rate and node resurrection rate, including:

提取每次发送心跳包对应获取的节点存活率和节点复活率；Extract the node survival rate and node resurrection rate corresponding to each heartbeat packet sent;

将所述节点存活率与预设的节点存活率阈值进行比较；Comparing the node survival rate with a preset node survival rate threshold;

当所述节点存活率低于预设的节点存活率阈值时，则利用所述节点存活率和节点复活率对心跳包发送时间间隔进行调整，获取调整后的心跳包发送时间间隔；When the node survival rate is lower than a preset node survival rate threshold, the heartbeat packet sending time interval is adjusted using the node survival rate and the node revival rate to obtain the adjusted heartbeat packet sending time interval;

其中，所述调整后的心跳包发送时间间隔通过如下公式获取：The adjusted heartbeat packet sending time interval is obtained by the following formula:

按照所述调整后的心跳包发送时间间隔向每个节点发送心跳包。A heartbeat packet is sent to each node according to the adjusted heartbeat packet sending time interval.

8.根据权利要求7所述的异构算力融合与动态优化分配方法，其特征在于，针对S5中将算力任务分配数据与算力节点连接数据进行数据映射，数据映射完成后进行网络层融合，还包括：8. The heterogeneous computing power fusion and dynamic optimization allocation method according to claim 7 is characterized in that, in step S5, the computing power task allocation data and the computing power node connection data are mapped, and after the data mapping is completed, network layer fusion is performed, and further comprising:

将任务映射数据进行网络层融合，进行网络层融合之前先将任务映射数据进行融合数据集整理，融合数据集包括从映射后的任务映射数据中提取每个任务的相关信息，以及从算力节点连接数据中提取各个节点的最新状态和资源利用情况；The task mapping data is integrated into a network layer. Before the network layer integration, the task mapping data is first organized into a fusion data set. The fusion data set includes extracting the relevant information of each task from the mapped task mapping data, and extracting the latest status and resource utilization of each node from the computing power node connection data.

根据融合数据集中的任务执行状态，对节点数据进行更新，节点数据包括各个节点被分配的任务数量、当前负载和可用资源，其中，当节点的负载超出预设范围时，进行动态调整；Based on the task execution status in the fused dataset, the node data is updated. The node data includes the number of tasks assigned to each node, the current load, and the available resources. When the node load exceeds the preset range, dynamic adjustments are made.

将节点数据更新或动态调整后数据进行数据集整合，数据集整合后得到标准算力任务分配数据。The node data is updated or dynamically adjusted to perform data set integration, and the standard computing power task allocation data is obtained after the data set integration.

9.异构算力融合与动态优化分配系统，应用在权利要求1-8任一项所述的异构算力融合与动态优化分配方法中，其特征在于，包括：9. A heterogeneous computing power fusion and dynamic optimization allocation system, applied to the heterogeneous computing power fusion and dynamic optimization allocation method according to any one of claims 1 to 8, characterized in that it includes:

任务执行监控单元，用于：Task execution monitoring unit, used to:

当标准算力任务分配数据进行任务实施时，对标准算力任务分配数据进行实时监控；When the standard computing power task allocation data is being implemented, the standard computing power task allocation data is monitored in real time;

对标准算力任务分配数据进行实时监控之前，先将监控指标进行确认，监控指标包括任务状态、执行进度、资源使用情况、节点负载和异常事件；Before real-time monitoring of standard computing power task allocation data, the monitoring indicators should be confirmed. The monitoring indicators include task status, execution progress, resource usage, node load, and abnormal events.

标准算力任务分配数据进行任务实施过程中，实时监控每个节点的监控指标；During the implementation of the standard computing power task allocation data, the monitoring indicators of each node are monitored in real time;

根据每个节点的监控指标进行标准算力任务分配数据的实时数据分析；Perform real-time data analysis of standard computing power task allocation data based on the monitoring indicators of each node;

根据分析的实时数据情况进行预警提示，其中，预警提示为实时数据的阈值超出预设范围，则将超出预设范围的实时数据标注为预警数据；Issue early warning prompts based on the analyzed real-time data. If the threshold of real-time data exceeds the preset range, the real-time data exceeding the preset range will be marked as early warning data.

将预警数据根据异常程度进行调整报告生成；Adjust the warning data according to the degree of abnormality and generate a report;

最终将预警数据和调整报告传输至显示终端进行可视化显示。Finally, the warning data and adjustment report are transmitted to the display terminal for visual display.