CN119862000A

Movatterモバイル変換

Info

Publication number: CN119862000A
Application number: CN202311370341.3A
Authority: CN
Inventors: 厍斌; 胡智恒; 徐晓忻; 王均松
Original assignee: Chengdu Huawei Technology Co Ltd
Current assignee: Chengdu Huawei Technology Co Ltd
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2025-04-22

Abstract

Description

Task scheduling method and related system

Technical Field

The present application relates to the field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, and in particular, to a task scheduling method, system, computing device cluster, chip, computer-readable storage medium, and computer program product.

Background

With the rapid development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, AI models are increasingly used in a variety of fields, such as image recognition, intelligent question-answering or recommendation systems, etc. To ensure accuracy and efficiency of AI model predictions, AI models typically need to be trained with a large amount of data to optimize AI model parameters. In addition, the inference process of AI models also requires fast and accurate processing and parsing of new input data to provide predictive results or to make decisions. How to optimize the performance of the AI model and improve the efficiency and accuracy of training or reasoning becomes a major concern in the industry.

Currently, the mainstream solution to optimize the performance of AI models is to optimize task scheduling in computational graphs. The computational graph, as a primary data structure of the AI model, describes the dependencies between various tasks. Specifically, each node in the computational graph represents a task, and each edge represents a data dependency between tasks. The execution sequence of the tasks can be reasonably arranged according to the data dependency relationship, so that the performance optimization of the AI model is realized.

However, existing task scheduling methods have difficulty meeting the high demands of AI models for computational efficiency and performance.

Disclosure of Invention

The application provides a task scheduling method, which is characterized in that a plurality of operators in a computational graph are grouped after being combined with dependency relation sequencing, and the operators are used as granularity to perform task issuing, so that effective scheduling under limited concurrency of the computational graph is realized, memory multiplexing probability reduction and cache hit rate reduction caused by overlarge concurrency are avoided, and the computing efficiency and performance of an AI model are improved. The application also provides a task scheduling system, a computing device cluster, a chip, a computer readable storage medium and a computer program product corresponding to the method.

In a first aspect, the present application provides a task scheduling method. The method may be performed by a task scheduling system. The task scheduling system may be a software system that may be part of the AI software framework or as a separate computing library for the developer to download and integrate into the application developed by the developer. The software system described above may be deployed in a chip or cluster of computing devices that execute program code of the software system to perform the task scheduling method of the present application. The task scheduling system may also be a hardware system, for example, a chip or a computing device cluster with task scheduling functions, where the hardware system executes the task scheduling method of the present application.

Specifically, the task scheduling system may obtain a computation graph, where the computation graph includes a plurality of operators, each operator in the plurality of operators represents a task, and then the task scheduling system may sort the plurality of operators according to a dependency relationship of the plurality of operators in the computation graph, to obtain an ordered operator sequence. For operators with dependent items in the ordered operator sequence, the dependent items precede the operators. For example, the ordered operator sequence includes a first operator and a second operator dependent on the first operator, the first operator being arranged before the second operator. The task scheduling system may then group the plurality of operators according to the ordered operator sequence to obtain a plurality of operator groups, and the task scheduling system may sequentially schedule the plurality of operator groups to a device (a device for executing an operator, e.g., an accelerator card, etc.) according to priorities of the plurality of operator groups.

According to the method, the operators are ordered, so that the follow-up operators can safely use the calculation results of the previous operators, the accuracy of the whole calculation process is guaranteed, the ordered operators are grouped, the operators are used as granularity to carry out task issuing, effective scheduling of limited concurrency of a calculation graph is achieved, memory multiplexing probability reduction and cache hit rate reduction caused by overlarge concurrency are avoided, and the calculation efficiency and performance of an AI model are improved.

In some possible implementations, the plurality of operator groups includes a first operator group and a second operator group, the first operator group having a higher priority than the second operator group. Accordingly, the task scheduling system may schedule the operators in the first operator group to the device first, and schedule the operators in the second operator group to the device when the operator scheduling in the first operator group is completed.

According to the method, the operator groups with low priority are scheduled after the operator groups with high priority are scheduled, so that effective scheduling under limited concurrency of a computational graph is realized, memory multiplexing probability reduction and cache hit rate reduction caused by excessive concurrency are avoided, and the calculation efficiency and performance of an AI model are improved.

In some possible implementations, the task scheduling system may schedule the plurality of operator groups to the device sequentially according to priorities of the plurality of operator groups via a priority queue. Specifically, the task scheduling system may add operators in a ready state in the plurality of operator groups to the priority queue, where the operators in the ready state include operators that are not dependent or operators that are dependent have all been executed, and the task scheduling system may take out a target operator from the priority queue and schedule the target operator to the device. Wherein the target operator has a higher priority than the remaining operators in the priority queue. For example, the target operator may be the operator with the highest priority in the priority queue. When the execution of the target operator is completed, the task scheduling system checks whether the execution of the operator relied on by the subsequent operator of the target operator is completed, if so, the subsequent operator is added to the priority queue to continue to take out a new operator from the priority queue for scheduling.

According to the method, the ready operators are added to the priority queue, then task scheduling is carried out by taking the operator group as granularity based on the priority queue control operators, the limited concurrency of the computational graph is realized, and the computing efficiency and performance of the AI model are improved.

In some possible implementations, the task scheduling system may obtain a group identifier of an operator group to which the target operator belongs, and when the group identifier of the operator group to which the target operator belongs is equal to a group identifier of the currently scheduled operator group, the task scheduling system schedules the target operator to the device. In this way, the operator of one operator group is ensured to be issued by identifying the group identifier of the operator group to which the operator belongs in the priority queue, and then the operator of the next operator group is issued, so that the memory multiplexing probability is prevented from being reduced and the cache hit rate is prevented from being reduced due to overlarge concurrency.

In some possible implementations, the task scheduling system may also update the number of operators to be scheduled in the currently scheduled operator group. And when the number of operators to be scheduled is equal to 0, switching the currently scheduled operator group into the next operator group. Therefore, the concurrency in the operator scheduling process can be controlled, the next operator group is scheduled after the current scheduled operator group is scheduled, the decrease of the memory multiplexing probability and the decrease of the cache hit rate caused by overlarge concurrency are avoided, and the calculation efficiency and the performance of the AI model are improved.

In some possible implementations, the task scheduling system may sort the multiple operators according to the dependency relationship of the multiple operators in the computation graph through a critical path policy or an inverse depth-first search policy, to obtain an ordered operator sequence. The core of the critical path is to find the longest path length from each operator to the output node and use the longest path length to assign Priority to each operator. The priority refers to the priority level of execution of each operator in the computational graph. Operators with higher priorities will take precedence to obtain computing resources for computation. When the demand or the target is the performance priority, the task scheduling system can sort the operators according to the dependency relationship of the operators in the calculation graph through a key path strategy to obtain an ordered operator sequence. Therefore, on the premise of ensuring the execution sequence of operators with a dependency relationship, nodes with higher priority can be traversed earlier, and the performance of an AI model is ensured. When the demand or the target is memory priority, the task scheduling system can sort the operators according to the dependency relationship of the operators in the computational graph through an inverse depth priority search strategy to obtain an ordered operator sequence. The core of the inverse depth-first search strategy is to invert the computation graph first so that all edges are opposite. And then performing depth-first search on the inverted computation graph, and inverting the obtained traversal sequence to obtain the topological order of the original graph. Therefore, the memory multiplexing rate can be guaranteed as much as possible on the premise of guaranteeing the execution sequence of operators with dependency relations.

In some possible implementations, the task scheduling system may group the plurality of operators according to an ordered operator sequence by a uniform grouping strategy to obtain a plurality of operator groups. Specifically, the task scheduling system may set a capacity of the operator group, which may be a constant N, representing the number of operators in the operator group. The task scheduling system may group a plurality of operators in the ordered operator queue according to the capacity of the operator group to obtain a plurality of operator groups. Wherein, even grouping is relatively simple, easy to realize, can reduce the complexity of task scheduling.

In some possible implementations, the task scheduling system may also obtain at least one of computational complexity, memory consumption, or load characteristics of the plurality of operators. Then, the task scheduling system can group the operators according to the ordered operator sequence and combining at least one of the computation complexity, the memory consumption or the load characteristics of the operators to obtain a plurality of operator groups. Memory consumption includes the memory size (size) occupied by the Input Output (IO) of the operator, and load characteristics are used to describe the load type of the operator, e.g., load characteristics may include computation-intensive or IO-intensive. In some examples, the task scheduling system may obtain the memory consumption of multiple operators, and when the memory size occupied by the inputs and outputs (e.g., inputs tensor and outputs tensor) of consecutive M operators exceeds the capacity C of the memory, the task scheduling system may add the next operator to the new operator group. In other examples, the task scheduling system may obtain load features of a plurality of operators, and when grouping the plurality of operators according to the ordered operator sequence, group the load features of the plurality of operators, so that the computationally intensive operators and the IO intensive operators of the same operator group are balanced, and the execution efficiency of the operators in the group is prevented from being affected.

According to the method, the operator grouping is carried out by combining the calculation complexity, the memory consumption or the load characteristics of a plurality of operators, so that the accuracy of the operator grouping can be improved, and the performance and the efficiency of the AI model are improved.

In some possible implementations, the task scheduling system may further obtain a scheduling result and a performance value corresponding to the scheduling result, and then update the scheduling parameter according to the scheduling result and the performance value. Wherein the scheduling parameters include at least one of packet parameters or ordering parameters. Therefore, the subsequent task scheduling is optimized through the historical scheduling data, so that the task scheduling system can be better adapted to different tasks and running environments, and the performance of the subsequent task scheduling is improved.

In a second aspect, the present application provides a task scheduling system. The system comprises:

The computing device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a computing graph, the computing graph comprises a plurality of operators, and each operator in the plurality of operators represents a task;

The sequencing module is used for sequencing the plurality of operators according to the dependency relationship of the plurality of operators in the calculation graph to obtain an ordered operator sequence, wherein the ordered operator sequence comprises a first operator and a second operator depending on the first operator, and the first operator is arranged before the second operator;

the grouping module is used for grouping the operators according to the ordered operator sequence to obtain a plurality of operator groups;

And the scheduling module is used for sequentially scheduling the plurality of operator groups to the equipment according to the priorities of the plurality of operator groups.

In some possible implementations, the plurality of operator groups includes a first operator group and a second operator group, the first operator group having a priority higher than a priority of the second operator group, and the scheduling module is specifically configured to:

Scheduling operators in the first operator group to a device;

and when the operator scheduling in the first operator group is completed, scheduling the operators in the second operator group to the equipment.

In some possible implementations, the scheduling module is specifically configured to:

adding operators in a ready state in the plurality of operator groups to a priority queue, wherein the operators in the ready state comprise operators without dependence or operators with dependence all executing completed operators;

Taking out a target operator from the priority queue, and scheduling the target operator to equipment, wherein the priority of the target operator is higher than that of the rest operators in the priority queue;

And when the execution of the target operator is completed, checking whether the execution of operators relied on by subsequent operators of the target operator is completed, and if so, adding the subsequent operators to the priority queue to continue to take out new operators from the priority queue for scheduling.

acquiring a group identifier of an operator group to which a target operator belongs;

and when the group identifier of the operator group to which the target operator belongs is equal to the group identifier of the currently scheduled operator group, scheduling the target operator to the equipment.

In some possible implementations, the scheduling module is further configured to:

Updating the number of operators to be scheduled in the current scheduled operator group;

and when the number of operators to be scheduled is equal to 0, switching the currently scheduled operator group into the next operator group.

In some possible implementations, the sorting module is specifically configured to:

and sequencing the operators according to the dependency relationship of the operators in the computational graph through a critical path strategy or an inverse depth-first search strategy to obtain an ordered operator sequence.

In some possible implementations, the grouping module is specifically configured to:

And grouping the operators according to the ordered operator sequence through a uniform grouping strategy to obtain a plurality of operator groups.

acquiring at least one of computational complexity, memory consumption or load characteristics of the operators;

And grouping the operators according to the ordered operator sequence and combining at least one of computational complexity, memory consumption or load characteristics of the operators to obtain a plurality of operator groups.

Acquiring a scheduling result and a performance value corresponding to the scheduling result;

And updating a scheduling parameter according to the scheduling result and the performance value, wherein the scheduling parameter comprises at least one of a grouping parameter or a sorting parameter.

In a third aspect, the present application provides a cluster of computing devices. The cluster of computing devices includes at least one computing device including at least one processor and at least one memory. The at least one processor and the at least one memory are in communication with each other. The at least one processor is configured to execute instructions stored in the at least one memory to cause a computing device or cluster of computing devices to perform the task scheduling method as described in the first aspect or any implementation manner of the first aspect.

In a fourth aspect, the present application provides a chip. The chip includes a processor, which may be a host-side central processor, and a memory. The processor executes the computer readable instructions to cause the chip to perform the task scheduling method as described in the first aspect or any implementation of the first aspect.

In a fifth aspect, the present application provides a computer readable storage medium having stored therein instructions for instructing a computing device or a cluster of computing devices to execute the task scheduling method according to any implementation manner of the first aspect or the first aspect.

In a sixth aspect, the present application provides a computer program product comprising instructions which, when run on a computing device or cluster of computing devices, cause the computing device or cluster of computing devices to perform the task scheduling method of the first aspect or any implementation of the first aspect.

Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.

FIG. 1 is a hardware configuration diagram of a server according to an embodiment of the present application;

FIG. 2 is a flowchart of a task scheduling method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a calculation diagram according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a communication deadlock according to an embodiment of the present application;

FIG. 5 is a flowchart of an embodiment of the present application for issuing an operation task;

FIG. 6 is a schematic diagram of a task scheduling system according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a computing device cluster according to an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating another computing device cluster according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computing device cluster according to an embodiment of the present application.

Detailed Description

The terms "first", "second" in embodiments of the application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature.

Some technical terms related to the embodiments of the present application will be described first.

A computational graph (Computational Graph), specifically a data structure that describes an AI model (e.g., a deep learning model). The computation graph represents the computation process in the AI model as a directed graph, nodes in the directed graph represent operators (OP for short), edges in the directed graph represent data dependencies (which may also be referred to as dependencies for short) between the operators, and the data dependencies indicate the flow paths of the data. Operators are used to perform computational operations or communication operations. For example, a computation operator may be used to perform addition, subtraction, multiplication, division, etc. computation operations, and a communication operator may be used to perform aggregate communication operations. Each operator may represent a task. Accordingly, the execution of the computational graph includes the scheduling of operators in the computational graph (i.e., the computational graph task scheduling) and the execution.

The task scheduling of the computational graph can also be simply called task scheduling, which refers to scheduling operators (or tasks represented by the operators) in the computational graph to hardware devices for executing tasks according to a scheduling policy (Scheduling Strategy). The hardware device is specifically a hardware resource (Hardware Resources) including, but not limited to, a central processor (central processing unit, CPU), a graphics processor (graphics processing unit, GPU) or a neural network processor (neural network processing unit, NPU), a tensor processor (tensor processing unit, TPU). Scheduling policies refer to rules or policies that govern the allocation of tasks represented by operators to hardware resources for execution.

The operation efficiency of the AI model can be effectively improved by optimizing the task scheduling, the waste of calculation resources is reduced, and the calculation cost is reduced. This is particularly important for AI models where resources are limited and computation requirements are enormous. Moreover, task scheduling optimization can ensure that tasks are completed at the fastest speed on the premise of meeting task dependency (namely, dependency relationship among tasks), so that training or reasoning time of an AI model is shortened, and overall efficiency is improved. In addition, stability and reliability of the AI model can be improved through careful design and optimized scheduling strategies, and system performance degradation or calculation errors caused by task scheduling problems are avoided.

Currently, a task scheduling method widely used in the industry includes a graph-based scheduling execution mechanism. Typical implementations of graph-based dispatch execution mechanisms include a graph core GRAPH KERNEL. GRAPH KERNEL is a completely new programming model implemented on a computing platform that allows a developer to build and execute the entire computational graph directly at the hardware level, rather than just a single operator. In this model, the compiled operators (called Kernels) are no longer executed independently, but are organized in a unified Graph (Graph). In the figure, the dependencies between Kernels are represented by edges (Edges), which means that execution of one Kernel may depend on completion of one or more other Kernels.

By organizing Kernels in one graph, GRAPH KERNEL can achieve a higher level of parallelism. Specifically, if all dependencies of a Kernel (e.g., all Kernel that Kernel depends on) have been satisfied, then the Kernel may begin execution immediately without waiting for other irrelevant Kernels. At any time, as long as there are hardware resources available, kernel may begin execution, thus enabling higher hardware utilization.

However, in the scheduling execution mechanism based on the graph, the concurrency of operators is large, and the execution sequence of the operators without dependence is dynamically determined, which tends to reduce the chance of memory multiplexing, resulting in increased memory consumption. The training process of the AI model is exemplified, and the memory consumption can be roughly divided into two types, namely (1) a static memory occupation for storing weight parameters, gradient information, state of an optimizer and the like, wherein the memory occupation is generally in a proportional relation with the scale of the AI model, in addition, tensors (tensors) occupying the part of the memory have the characteristic of durability, the life cycle of the tensors is the whole stage from training to convergence of the AI model, and (2) a new memory occupation of the Tensor dynamically generated by the AI model in the forward and reverse running processes, the life cycle of the Tensor is the time from the generation of the Tensor to the time when the Tensor is not needed to be used any more, and the memory occupation of the part is the dynamic memory occupation. Wherein dynamic memory occupancy may have the opportunity to be multiplexed. In the scheduling execution mechanism based on the graph, if multiple independent operators are executed simultaneously, the operators which are executed simultaneously need to keep own data (such as new tensor dynamically generated in the forward or reverse running process) in the memory, but the memory space of the operators which are already completed cannot be shared, so that not only the overall memory consumption is increased, but also the memory fragmentation is possibly caused, and the use efficiency of the memory is reduced.

Graph-based scheduling mechanisms typically take a hunger-thirst pattern, i.e., putting the hardware devices in saturation as much as possible. In this mode, an early Ready (Ready) operator may be caused to execute preferentially, and thus a large number of other operators may be interposed between the two operators a and B that are dependent for execution. Because the capacity of the Cache (Cache) is limited, the output result of the operator A originally located in the Cache may be eliminated, and when the output result A is executed to B, the output result A needs to be loaded into the Cache from the memory again, so that the hit rate of the Cache is greatly reduced, and the overall execution efficiency is influenced.

Whether the memory consumption is increased or the Cache hit rate is reduced, the calculation efficiency and performance of the AI model can be reduced, and the high requirements of the AI model on the calculation efficiency and performance are difficult to meet.

In view of this, the present application provides a task scheduling method. The method may be performed by a task scheduling system. The task scheduling system may be a software system that may be part of the AI software framework or as a separate computing library for the developer to download and integrate into the application developed by the developer. The software system described above may be deployed in a chip (e.g., a CPU) or cluster of computing devices that execute the program code of the software system to perform the task scheduling method of the present application. The task scheduling system may also be a hardware system, for example, a chip or a computing device cluster with task scheduling functions, where the hardware system executes the task scheduling method of the present application.

Specifically, the task scheduling system may obtain a computation graph, where the computation graph includes a plurality of operators, each operator in the plurality of operators represents a task, and then the task scheduling system may sort the plurality of operators according to a dependency relationship of the plurality of operators in the computation graph, to obtain an ordered operator sequence. For operators with dependent items in the ordered operator sequence, the dependent items precede the operators. For example, the ordered operator sequence includes a first operator and a second operator dependent on the first operator, the first operator being arranged before the second operator. The task scheduling system may then group the plurality of operators according to the ordered operator sequence to obtain a plurality of operator groups, and the task scheduling system may sequentially schedule the plurality of operator groups to a device (the device for executing operators) according to priorities of the plurality of operator groups.

The task scheduling method is a task scheduling method based on strategy control, and can be widely applied to optimization of various AI models. The AI model may be one that requires real-time or near real-time processing tasks, including, but not limited to, automatic driving, video processing, etc. applications. The task scheduling method based on policy control can effectively schedule and manage tasks, thereby meeting strict time limitation. In addition, the method can be applied to the training stage or the reasoning stage of the AI model, so that the training process or the reasoning process of the AI model is accelerated, and the high-efficiency utilization of the computing resources is realized. It should be noted that, the task scheduling method can be extended to other application scenarios requiring task scheduling, including but not limited to big data processing, scientific computing, and graphic rendering, which is not limited by the present application.

In some possible implementation manners, the task scheduling method can be applied to multi-core computing equipment, and on a GPU, a TPU or other hardware equipment with multiple computing cores, the task scheduling method based on policy control can be adopted to effectively allocate and manage computing resources of each core, so that the overall operation efficiency is improved. In other possible implementations, the task scheduling method may also be applied to a massively parallel computing architecture, such as a parallel computing architecture formed by multiple computing devices. In a massive parallel computing environment, such as a distributed system or a clustered computing environment, the task scheduling method can furthest improve the parallelism on the premise of ensuring the correctness of an algorithm, thereby shortening the overall execution time of the task.

For ease of description, the following is illustrated with a multi-core computing device as a server.

Referring to the hardware architecture of one server shown in fig. 1, the server 10 may be a server with a stand-alone multi-card architecture, and multiple servers 10 shown in fig. 1 may also form a server cluster (also referred to as a computing device cluster) with a multi-machine multi-card architecture. In the x-machine x-card, "machine" means a host (host), and "card" means a device (device) such as an accelerator card (also referred to as an accelerator).

The server 10 includes a host 12 and a device 14. Wherein the host 12 and the device 14 are connected. The host 12 includes a processor, which may be a CPU, and memory, which may be Dual In-line Memory Module (DIMM), which is a host memory. The DIMM may specifically be a Double Data Rate (DDR) type, for example, the memory may be a DDR4 DIMM. In the example of FIG. 1, host 12 includes 4 CPUs and 4 DDR4 DIMM groups, each CPU is connected to 1 DDR4 DIMM group, each DDR4 DIMM group includes 8 DDR4 DIMMs. Multiple CPUs of the host 12 can be connected to form HYDRA MESH.

Optionally, host 12 also includes interfaces such as one or more of a serial advanced technology attachment (SERIAL ADVANCED Technology Attachment, SATA) interface, a new generation Non-volatile memory (Non-Volatile Memory express, NVMe) interface, and a Gigabit Ethernet (GE) interface. Host 12 may also include memory. The memory may include SATA-enabled memory or NVMe-enabled memory, such as a SATA-enabled mechanical hard disk, or NVMe-enabled solid state disk (solid STATE DRIVE, SSD).

The device 14 comprises an accelerator card. In the example of fig. 1, the acceleration card included in the device 14 may be an NPU. Fig. 1 illustrates an apparatus 14 including 8 NPUs. In other possible implementations of embodiments of the present application, device 14 may also include more accelerator cards, such as more types of accelerator cards, or a greater number of accelerator cards.

Fig. 1 is merely an example of a computing device that performs the task scheduling method of the present application, and in other possible implementations of the embodiment of the present application, the computing device may be a server of a stand-alone single-card architecture, which is not limited in this embodiment.

In order to make the technical scheme of the application clearer and easier to understand, the task scheduling method of the application is described in detail below by combining the embodiment.

Referring to a flowchart of a task scheduling method shown in fig. 2, the method may be executed by a server, and specifically includes the following steps:

S202, the server acquires a calculation map.

The computational graph includes a plurality of operators, each of the plurality of operators representing a task. The tasks may be different according to operator types, for example, the operators may be computation operators and communication operators, and accordingly, the tasks may be computation tasks or communication tasks (such as cluster communication tasks).

For ease of understanding, the present application also provides an example of a computational graph. As shown in fig. 3, the computational graph is essentially a directed acyclic graph (DIRECTED ACYCLIC GRAPH, DAG). In the figure, the rectangular parts correspond to operators, different colors or gray scales of the rectangular parts are used for representing different operator types, and operators represented by rectangles with different colors or gray scales can be executed on different hardware units. The first line in the rectangular section represents the Identity (ID) of the ith operator, and the contents in brackets in the second line represent the time t_i required for the operator to complete execution. The oval in the figure represents the output of each operator, e.g., the Tensor (Tensor) name of the operator output and its size. The arrow indicates that a particular Tensor is an input or output of a particular operator. For example, the arrow of operator A pointing to the node T_A1 indicates that Tensor T_A1 is some output of operator A, while the arrow of node T_A1 pointing to operator B indicates that Tensor T_A1 is one of the inputs of operator B.

S204, the server sorts the operators according to the dependency relationship of the operators in the calculation graph to obtain an ordered operator sequence.

The order of execution of the operators in the computational graph needs to satisfy the topological ordering (Topological Sorting) rule. Topology ordering is a concept of graph theory for ordering nodes on a Directed Acyclic Graph (DAG) such that for each directed edge (u, v), node u is ordered before node v. Based on the above, the server can sort the operators according to the dependency relationship of the operators in the computational graph, and obtain an ordered operator sequence. Wherein there are operators that depend on the term in the ordered operator sequence, the dependent term being arranged before the operator. For example, the ordered operator sequence includes a first operator and a second operator, the second operator depends on the first operator, that is, the first operator is a dependent item of the second operator, and the first operator is arranged before the first operator.

In specific implementation, the server may select different sorting strategies or combinations of sorting strategies according to requirements, and sort the plurality of operators to obtain an ordered operator sequence. Different sorting strategies are exemplified below.

One sort strategy is a critical path (CRITICAL PATH) strategy. When the demand or the target is performance priority, the server can sort the operators according to the dependency relationship of the operators in the calculation graph through CRITICAL PATH strategies to obtain an ordered operator sequence. CRITICAL PATH is to find the longest path length from each operator to the output node and use the longest path length to assign Priority to each operator. The priority refers to the priority level of execution of each operator in the computational graph. Operators with higher priorities will take precedence to obtain computing resources for computation. For example, the server may determine the priority of an operator based on the dependencies of multiple operators and the longest path length of the operator to the output node. Nodes with higher priorities (e.g., longer paths to output nodes) may be ordered before, in other words nodes with higher priorities may be traversed earlier, with guaranteed execution order of operators with dependencies.

In specific implementation, the server may initialize the longest path length of all operators to 0, and then traverse to the direction of the output operator for each operator by using a dynamic programming strategy, and update the longest path length of each operator. For any operator, its longest path length should be equal to the largest longest path length of all its subsequent operators plus 1. And then the server can determine the priority of each operator according to the dependency relationship among operators and the longest path length, and order the operators based on the priority to obtain an ordered operator sequence. On the premise of meeting the topological ordering rule, operators with long longest path lengths are ranked before being traversed.

Another sort strategy is the inverse depth first search (DEEP FIRST SEARCH, DFS) strategy. When the demand or the target is memory priority, the server can sort the operators according to the dependency relationship of the operators in the calculation graph through an inverse DFS strategy to obtain an ordered operator sequence. The core of the inverse DFS strategy is to invert the computation graph first so that all edges are opposite. And then performing depth-first search on the inverted computation graph, and inverting the obtained traversal sequence to obtain the topological order of the original graph.

In a specific implementation, the server may invert the calculation graph G1 to obtain the calculation graph G2. For example, if there is an edge in the computation graph G1 from operator a to operator B, then in G2 the direction of this edge will point from operator B to operator a. The server then performs a depth-first search of the computation graph G2. The server may start searching from the output node, and the order obtained at the end of the traversal is the reverse topological order of the computation graph G1. The server inverts the inverse topological ordering of the calculation graph G1 to obtain the topological ordering of the calculation graph G1.

S206, the server groups the plurality of operators according to the ordered operator sequence to obtain a plurality of operator groups.

The operator groupings (Operator Grouping) may be such that operators with the same or similar priorities are partitioned into the same operator group for efficient task scheduling. The operators in each operator group have the same or similar priority, and the grouping mode is beneficial to effectively executing the operators in parallel in the subsequent scheduling process, and meanwhile, the accuracy of execution can be ensured.

In particular, the server may set the priority according to the topologically ordered index. In some examples, the server may directly use the topologically ordered index as the priority of the operators, in other words, the top-ordered operator has a higher priority. In other examples, the server may also compare the topologically ordered index to at least one threshold to obtain the priority of the corresponding operator. When indexes of different operators in topological sequence fall into the same value interval (value interval determined by at least one threshold), the different operators have the same priority, and when indexes of different operators in topological sequence fall into the different value intervals respectively, the priority of the operator with the smaller index is higher than that of the operator with the larger index.

One grouping strategy may be uniform grouping. For example, the server may set the capacity of each operator group, which may be a constant N, representing the number of operators in the operator group. The server may group a plurality of operators in the ordered operator queue according to the capacity of the operator group to obtain a plurality of operator groups. For example, when the computation graph includes l×n operators, the operators in the ordered operator queue may be uniformly divided into L operator groups, each operator group including N operators in succession in the ordered operator queue.

Another grouping strategy may be grouping based on operator characteristics. In particular, the computational complexity, memory consumption, or load characteristics of different operators may be different, where the memory consumption includes a memory size (size) occupied by an Input Output (IO) of the operator, and the load characteristics are used to describe a load type of the operator, for example, the load characteristics may include computationally intensive or IO intensive. In order to ensure the execution efficiency and performance, the server may group the plurality of operators according to the ordered operator sequence in combination with at least one of computational complexity, memory consumption or load characteristics of the plurality of operators to obtain a plurality of operator groups.

For example, the server may obtain the memory consumption of multiple operators, and when the memory size occupied by the inputs and outputs (e.g., inputs tensor and outputs tensor) of consecutive M operators exceeds capacity C, the server may add the next operator to the new operator group.

For another example, the server may obtain load features of a plurality of operators, and when grouping the plurality of operators according to the ordered operator sequence, combine the load features of the plurality of operators to perform grouping, so that the computation-intensive operators and the IO-intensive operators of the same operator group are balanced, and the execution efficiency of the operators in the group is prevented from being affected.

It should be noted that, the server may also support more scheduling policies during task scheduling, including a sorting policy and a grouping policy. For example, a user may customize the scheduling policy in combination with the computational complexity of the operator, memory consumption (including IO requirements), and load characteristics. On the basis of the computational graph task scheduling method based on policy control, the method can be used for combining hardware optimization means with scheduling policies, for example, the method supports optimizing the scheduling policies according to the characteristics (such as cache size, memory bandwidth and the like) of specific hardware.

Further, the server may record a group Identifier (ID) of an operator group in which each operator is located. In this way, in the subsequent execution process, the server can quickly find the corresponding operator according to the group id of the operator group where the operator is located, thereby improving the task scheduling efficiency. Similarly, the server can also count the number of operators in the operator groups, so that in the subsequent execution process, the execution condition of each operator group can be effectively managed and controlled.

The steps S202 to S206 are steps that the method needs to execute in the compiled state (Compile-time). Where compile state refers to the stage of program or model compilation, operations performed in the compile state include, but are not limited to, topological ordering, prioritizing, and the like. The compiling-state processing step is beneficial to efficiently and accurately scheduling operators in a calculation map in a running state (Run-time), so that the running performance of the whole AI model is improved.

S208, the server sequentially dispatches the plurality of operator groups to the equipment according to the priorities of the plurality of operator groups.

The host of the server may schedule the plurality of operator groups to the device sequentially according to priorities of the operator groups with the operator groups as granularity, so that the device executes tasks corresponding to the operators in the operator groups according to the priorities. Wherein the priorities of the operator groups can be determined according to the priorities of the operators in the operator groups. For example, operators in an operator group have the same priority, which may be characterized by the priority of the operators.

Specifically, the plurality of operator groups may include a first operator group and a second operator group, where the priority of the first operator group is higher than that of the second operator group, and accordingly, the server may schedule tasks corresponding to operators in the first operator group to the device first, and when the operator scheduling in the first operator group is completed, the server may schedule tasks corresponding to operators in the second operator group to the device.

In some possible implementations, the server may add operators in a ready (ready) state in the plurality of operator groups to a priority queue, from which task delivery with operator groups granularity is implemented. Wherein the operators in the ready state include operators that are not dependent or operators that are dependent have all performed completed. In the initial phase, the operators in the ready state are typically non-dependent operators. The server may fetch the target operator from the priority queue and schedule (issue) the target operator to the device (device). Wherein the target operator has a higher priority than the remaining operators in the priority queue. In some examples, the target operator may be the highest priority operator in the priority queue. When the target operator execution is completed, the server may check whether the operator on which the subsequent operator of the target operator depends is completed, in other words, the server checks whether the completion of the target operator execution makes the subsequent operator ready. If so, the server may add the subsequent operator to the priority queue to continue to fetch a new operator (e.g., a new target operator) from the priority queue for task scheduling.

When the target operator is fetched from the priority queue and issued to the device, the server can acquire the group identifier of the operator group to which the target operator belongs. When the group identifier of the operator group to which the target operator belongs is equal to the group identifier of the currently scheduled operator group, indicating that the target operator is an operator in the currently scheduled operator group, and the server transmits the target operator to the device. When the group identifier of the operator group to which the target operator belongs is not equal to the group identifier of the currently scheduled operator group, the server can not process the group identifier, and schedule the target operator after the operator scheduling in the currently scheduled operator group is completed.

Further, the server may check the number of operators to be scheduled in the currently scheduled operator group. When the number of operators to be scheduled is equal to 0, which indicates that operators in the currently scheduled operator group have all been scheduled, the server may switch the currently scheduled operator group to a next operator group, which may be the operator group with the highest priority arranged after the currently scheduled operator group, for example, the first operator group arranged after the currently scheduled operator group. When the operator group is switched, the number of operators to be scheduled can be initialized to the number of operators in the switched operator group. Further, when the number of operators to be scheduled is not equal to 0, it indicates that operators in the currently scheduled operator group are not scheduled, and operators with higher priority are not enqueued in the currently scheduled operator group.

In some possible implementations, the server may also obtain the scheduling result and a performance value (e.g., inference time, memory consumption, computing resource utilization) corresponding to the scheduling result. The performance value corresponding to the scheduling result may be used as a feedback signal. Accordingly, the server may update the scheduling parameters according to the scheduling result and the performance value. Wherein the scheduling parameters include at least one of a grouping parameter (e.g., capacity C of memory) or a sorting parameter (e.g., capacity of an operator group). The scheduling strategy is continuously learned and optimized through the scheduling result and the performance value fed back in real time, so that the task scheduling system can be better adapted to different tasks and running environments.

The step S208 is a step that the method needs to be executed in an operation state, and the server may schedule and execute the operators in the computation graph through a policy with high priority according to the data such as the topology ordering, the priority, the grouping information, the group id and the like generated by the compiling state.

Based on the description, the task scheduling method provided by the application can realize limited concurrent task scheduling by grouping the sequenced operators and issuing the tasks with the operators as granularity, thereby improving the memory multiplexing opportunity, solving the problem of low Cache hit rate and improving the end-to-end performance and the memory utilization of the AI model.

Moreover, graph-based scheduling mechanisms may cause problems with communication deadlocks. Specifically, referring to fig. 4, devices of different serial numbers (ranks), such as accelerator card 0 (denoted as Rank 0), accelerator card 1 (denoted as Rank 1), each run multiple threads, including thread 0 (thread 0) and thread 1 (thread 1). However, in the scheduling mechanism based on the graph, concurrency progress on different devices is inconsistent, which may cause different issuing orders of communication operators, and finally cause deadlock. For example, if the communication operator executed by the thread 0 on the accelerator card 0 and the communication operator executed by the thread 1 on the accelerator card 1 are waiting for data transmitted by each other and do not transmit data of their own, a deadlock may occur. This may result in the entire calculation process stopping. To solve the problem of communication deadlock, at most one communication operator can be configured in the calculation graph, which can reduce the scale of the calculation graph and reduce the opportunity of concurrent execution of different types of operators. The task scheduling method of the application groups the communication operators in the computational graph, each operator group can comprise at most one communication operator, and the order of the operator groups is determined, so that the issuing order of the communication operators can be consistent, thereby solving the problem of communication deadlock and ensuring the execution efficiency.

Next, the execution process of the running state will be described in detail with reference to another embodiment.

Referring to the flowchart of another task scheduling method shown in fig. 5, the server may make the number of operator groups m, the number of operators of the kth group N_k, k= (1, 2.), m), operator i has a group ID of g_i, index g_c indicates the group ID of the currently scheduled operator group, and R_c indicates the number of operators to be scheduled in the currently scheduled operator group g_c, that is, the number of operators remaining undelivered in the currently scheduled operator group g_c. The server may then perform the method steps of the following stages:

Stage one, initializing.

In particular, the server may set a pointer, which may point to the currently scheduled operator group. At the time of initialization, the pointer may point to the group with the highest priority (in this embodiment, it is assumed that the group ID of the operator group with the highest priority is m), and then the server may complete the following initialization steps:

a) Initializing an empty priority queue, and marking the empty priority queue as Q;

b)g_c=m;

c)R_c＝N_gc。

and step two, enqueuing the ready operator.

Specifically, the server may place an operator in a ready state (also referred to as a ready operator) into the priority queue Q according to its calculated priority in compiled state. Wherein, the operator is in the ready state means that the operator has no dependent items, or that all dependent items of the operator have been executed. In the initial stage, the ready operators are operators which do not depend on the output results of other operators, and a calculation graph has at least one operator in a ready state in the initial stage.

And step three, operator scheduling.

Specifically, the server may select, according to the priorities of the operators, the operator i with the highest current priority from the priority queue Q for issuing. The operators stored in the priority queue are all in a ready state, and the operators start to be executed on the Device after being issued. The issuing of operators can be divided into two cases:

a) If the group IDg_i of the operator group where the operator i is located is not equal to g_c, directly returning to temporarily not issuing the operator i;

b) If the group IDg_i of the operator group where the operator i is located is equal to g_c, the operator i is fetched from the priority queue, the operator i is issued for execution, and then R_c←R_c -1 is updated.

And step four, operator execution is completed.

In particular, each operator execution completion may trigger a dispatch interrupt, and the server may check, in response to the dispatch interrupt, whether all subsequent operators of that operator are thus ready. If an operator is ready, the server may mark the status of the subsequent operator as ready and add the subsequent operator to the priority queue Q.

And step five, operator group switching.

When all operators of one operator group have been issued, the server may begin issuing the next priority group. Specifically, the server may direct the pointer to the next priority operator group, and the specific switching logic is:

a) If R_c =0, it indicates that the number of remaining undelivered operators in the currently scheduled operator group g_c is 0, that is, the switching of the operator group needs to be performed. The server may update g_c and R_c, with the update criteria g_c←g_c -1,When g_c <0, it indicates that all operators have been issued to the Device for execution. Otherwise, jumping to the third execution stage.

B) If R_c is not equal to 0, it indicates that the currently scheduled operator group g_c also has higher priority operators that are not in the priority queue, and no switching of the operator groups needs to be performed. The server repeats phases two through five until g_c = 0, indicating that all operator groups are issued.

In this method, when all operator execution is complete, indicating that the entire computational graph has completed execution, the server may return to the end.

According to the algorithm flow in the running state, the execution condition of each group of operators is monitored, so that the next group of operators is issued after all operators in the high-priority group are issued, the effective scheduling of the limited concurrent issuing of the calculation graph is realized, the execution performance of the whole AI model is improved, and the requirements of different AI models can be met. Meanwhile, the method also solves the problems of poor universality, frequent operator starting operation, high synchronization overhead and incapability of parallelism caused by inter-operator dependence of a scheduling execution mechanism based on the flow, and solves the problems of increased memory consumption, reduced cache hit rate, communication deadlock and the like of the scheduling execution mechanism based on the graph.

Based on the task scheduling method, the application also provides a task scheduling system. As shown in fig. 6, the system 600 includes:

an obtaining module 602, configured to obtain a computation graph, where the computation graph includes a plurality of operators, and each operator in the plurality of operators represents a task;

A sorting module 604, configured to sort the plurality of operators according to the dependency relationships of the plurality of operators in the computation graph, so as to obtain an ordered operator sequence, where the ordered operator sequence includes a first operator and a second operator that depends on the first operator, and the first operator is arranged before the second operator;

A grouping module 606, configured to group the plurality of operators according to the ordered operator sequence, to obtain a plurality of operator groups;

A scheduling module 608, configured to schedule the plurality of operator groups to the device sequentially according to priorities of the plurality of operator groups.

Illustratively, the acquisition module 602, the ordering module 604, the grouping module 606, and the scheduling module 608 described above may be implemented by hardware, or may be implemented by software.

Wherein, when implemented in software, the acquisition module 602, the ordering module 604, the grouping module 606, the scheduling module 608 may be an application running on a computer device, such as a computing engine, or the like. The above-described computing program may also be provided to the user through virtualization by a virtualization service. The virtualization service may include a Virtual Machine (VM) service, a Bare Metal Server (BMS) service, and a container service. The VM service may be a service that virtualizes a Virtual Machine (VM) resource pool on a plurality of physical hosts through a virtualization technology to provide a user with a VM for use as needed. The BMS service is a service for virtualizing a BMS resource pool on a plurality of physical hosts to provide a user with BMS for use on demand. A container service is a service that virtualizes a pool of container resources on multiple physical hosts to provide users with containers for use on demand. A VM is a virtual computer that is modeled, i.e., a computer that is logically. The BMS is elastically telescopic high-performance computing service, has no difference between computing performance and traditional physical machines, and has the characteristic of safe physical isolation. The container is a kernel virtualization technology, and can provide lightweight virtualization to achieve the purpose of isolating user space, processes and resources. It should be understood that the VM service, the BMS service, and the container service in the above-mentioned virtualization service are merely specific examples, and the virtualization service may be other lightweight or heavy-weight virtualization services in practical applications, which are not specifically limited herein.

When implemented in hardware, the acquisition module 602, the ordering module 604, the grouping module 606, the scheduling module 608 may include at least one computing device. Or the acquisition module 602, the ordering module 604, the grouping module 606, the scheduling module 608 may also be a device implemented using an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or the like. The PLD may be implemented as a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, a general-purpose array logic (GENERIC ARRAY logic, GAL), or any combination thereof.

In some possible implementations, the plurality of operator groups includes a first operator group and a second operator group, the first operator group having a higher priority than the second operator group. The scheduling module 608 is specifically configured to schedule the operators in the first operator group to the device, and schedule the operators in the second operator group to the device when the operator scheduling in the first operator group is completed.

In some possible implementations, the scheduling module 608 is specifically configured to:

In some possible implementations, the scheduling module 608 is further configured to:

In some possible implementations, the sorting module 604 is specifically configured to:

In some possible implementations, the grouping module 606 is specifically configured to:

The present application also provides a computing device 700. As shown in FIG. 7, computing device 700 includes a bus 702, a processor 704, an accelerator 705, a memory 706, and a communication interface 708. Communication between processor 704, memory 706, and communication interface 708 is via bus 702. Computing device 700 may be a server or a terminal device. It should be understood that the present application is not limited to the number of processors, memories in computing device 700.

Bus 702 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 7, but not only one bus or one type of bus. Bus 702 may include a path for transferring information between various components of computing device 700 (e.g., memory 706, processor 704, communication interface 708).

The processor 704 may include a central processing unit (central processing unit, CPU), typically located on the host side. The accelerator 705 may include any one or more of a graphics processor (graphics processing unit, GPU), or a neural network processor (neural network processing unit, NPU), tensor processor (tensor processing unit, TPU), the accelerator 705 typically being located on the device side.

The memory 706 may include volatile memory (RAM), such as random access memory (random access memory). The memory 706 may also include non-volatile memory (ROM), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD), or Solid State Disk (SSD) STATE DRIVE. The memory 706 has stored therein executable program code that is executed by the processor 704 to implement the task scheduling method described previously. Specifically, the memory 706 has stored thereon instructions for the task scheduling system 600 to execute the task scheduling method.

Communication interface 708 enables communication between computing device 700 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, or the like.

The embodiment of the application also provides a computing device cluster. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop, notebook, or smart phone.

As shown in fig. 8, the cluster of computing devices includes at least one computing device 700. The memory 706 in one or more computing devices 700 in the cluster of computing devices may have stored therein instructions of the same task scheduling system 600 for performing the task scheduling method.

In some possible implementations, one or more computing devices 700 in the cluster of computing devices may also be used to execute some instructions of the task scheduling system for performing the task scheduling method. In other words, a combination of one or more computing devices 700 may collectively execute instructions of the task scheduling system 600 for performing a task scheduling method.

It should be noted that the memory 706 in different computing devices 700 in a cluster of computing devices may store different instructions for performing part of the functions of the task scheduling system.

Fig. 9 shows one possible implementation. As shown in fig. 9, two computing devices 700A and 700B are connected through a communication interface 708. Instructions for performing the functions of the acquisition module 602, the ordering module 604, and the grouping module 606 are stored in memory in the computing device 700A. Instructions for performing the functions of scheduling module 608 are stored in memory in computing device 700B. In other words, the memory 706 of the computing devices 700A and 700B collectively store instructions for the task scheduling system 600 for performing the task scheduling method.

The connection manner between the computing device clusters shown in fig. 9 may be that, considering that the task scheduling method provided by the present application needs to perform sorting and grouping in a compiled state, task scheduling is performed according to the grouping result in an operating state. Thus, it is contemplated that the functions implemented by scheduling module 608 may be performed by computing device 700B.

It should be appreciated that the functionality of computing device 700A shown in fig. 9 may also be performed by multiple computing devices 700. Likewise, the functionality of computing device 700B may also be performed by multiple computing devices 700.

In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc. Fig. 10 shows one possible implementation. As shown in fig. 10, two computing devices 700C and 700D are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, instructions to perform the functions of the acquisition module 602, the ordering module 604, and the grouping module 606 are stored in the memory 706 in the computing device 700C. Meanwhile, the memory 706 in the computing device 700D has stored therein instructions for performing the functions of the scheduling module 608.

The connection manner between the clusters of computing devices shown in fig. 10 may be that, considering that the task scheduling method provided by the present application needs to perform sorting and grouping in a compiled state, task scheduling is performed according to the grouping result in an operating state, so that the functions implemented by the scheduling module 608 are considered to be performed by the computing device 700D.

It should be appreciated that the functionality of computing device 700C shown in fig. 10 may also be performed by multiple computing devices 700. Likewise, the functionality of computing device 700D may also be performed by multiple computing devices 700.

The embodiment of the application also provides a chip. The chip includes a processor, which may be a host-side CPU, and a memory. In particular implementations, the processor may execute the computer readable instructions to cause the chip to perform the task scheduling method of the foregoing embodiment.

The embodiment of the application also provides a computer readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer readable storage medium includes instructions that instruct a computing device to perform the above-described method for performing task scheduling that is applied to the task scheduling system 600.

Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform the task scheduling method described above.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the protection scope of the technical solution of the embodiments of the present invention.

Claims

1. A method of task scheduling, the method comprising:

Obtaining a computational graph, wherein the computational graph comprises a plurality of operators, and each operator in the plurality of operators represents a task;

Sequencing the operators according to the dependency relationship of the operators in the computational graph to obtain an ordered operator sequence, wherein the ordered operator sequence comprises a first operator and a second operator depending on the first operator, and the first operator is arranged before the second operator;

grouping the plurality of operators according to the ordered operator sequence to obtain a plurality of operator groups;

and sequentially scheduling the plurality of operator groups to the equipment according to the priorities of the plurality of operator groups.

2. The method of claim 1, wherein the plurality of operator groups comprises a first operator group and a second operator group, the first operator group having a higher priority than the second operator group, the sequentially scheduling the plurality of operator groups to a device based on the priorities of the plurality of operator groups, comprising:

Scheduling operators in the first operator group to a device;

3. The method of claim 1 or 2, wherein sequentially scheduling the plurality of operator groups to a device according to priorities of the plurality of operator groups comprises:

4. A method according to claim 3, wherein said fetching a target operator from said priority queue, scheduling said target operator to a device, comprises:

5. A method according to claim 3, characterized in that the method further comprises:

6. The method according to any one of claims 1 to 5, wherein said sorting the plurality of operators according to the dependency relationship of the plurality of operators in the computational graph to obtain an ordered operator sequence comprises:

7. The method of any one of claims 1 to 6, wherein grouping the plurality of operators from the ordered operator sequence to obtain a plurality of operator groups comprises:

8. The method of any one of claims 1 to 6, wherein grouping the plurality of operators from the ordered operator sequence to obtain a plurality of operator groups comprises:

9. The method according to any one of claims 1 to 8, further comprising:

10. A task scheduling system, the system comprising:

11. The system of claim 10, wherein the plurality of operator groups comprises a first operator group and a second operator group, the first operator group having a higher priority than the second operator group, the scheduling module being specifically configured to:

Scheduling operators in the first operator group to a device;

12. The system according to claim 10 or 11, wherein the scheduling module is specifically configured to:

13. The system of claim 12, wherein the scheduling module is specifically configured to:

14. The system of claim 12, wherein the scheduling module is further configured to:

15. The system according to any one of claims 10 to 14, wherein the ranking module is specifically configured to:

16. The system according to any one of claims 10 to 15, wherein the grouping module is specifically configured to:

17. The system according to any one of claims 10 to 15, wherein the grouping module is specifically configured to:

18. The system of any one of claims 10 to 17, wherein the scheduling module is further configured to:

19. A computing device cluster comprising at least one computing device, the at least one computing device comprising at least one processor and at least one memory, the at least one memory having computer-readable instructions stored therein, the at least one processor executing the computer-readable instructions to cause the computing device cluster to perform the task scheduling method of any of claims 1-9.

20. A chip comprising a processor and a memory, the processor executing the computer-readable instructions to cause the chip to perform the task scheduling method of any one of claims 1 to 9.

21. A computer readable storage medium comprising computer readable instructions for implementing the task scheduling method of any one of claims 1 to 9.

22. A computer program product comprising computer readable instructions for implementing a task scheduling method according to any one of claims 1 to 9.