The present application claims priority from the chinese patent application filed on 29 th 2023, 06, filed on 29 th date with the chinese national intellectual property agency, application number 202310792403.3, entitled "method for executing a calculation map", the entire contents of which are incorporated herein by reference.
Disclosure of Invention
The application provides an operator processing method, which constructs an operator execution queue based on topology information of a computational graph, thereby realizing the construction of the execution graph in the form of the operator execution queue, being flexibly applied to all scenes needing batch operator scheduling and meeting business requirements. The application also provides an operator processing device, a computer readable storage medium and a computer program product corresponding to the operator processing method.
In a first aspect, the present application provides an operator processing method. The method may be performed by an operator processing device. The operator processing means may be a software means, which may be a stand-alone software package, or a cloud service, for example provided through an interface (e.g. a composition interface), supporting a user to represent the topology map in a queue based on operator custom group maps. The software means described above may be deployed in a cluster of computing devices (e.g. a cluster of computing devices executing an operator or a third party computing device) executing program code of the software means for performing the operator processing method of the application. The operator processing apparatus may also be a hardware apparatus, for example a computing device or cluster of computing devices with operator processing functionality. When the operator processing device runs, the operator processing method of the application is executed.
Specifically, the operator processing device acquires topology information of a computation graph, the topology information records a plurality of operators included in the computation graph, graph input data, graph output data, graph intermediate data and dependency relations of the operators, and then constructs an operator execution queue according to the topology information of the computation graph. Each element in the operator execution queue corresponds to one operator in the computation graph, and input data and output data of the operator, wherein the input data of the operator is at least one of graph input data or graph intermediate data, the output data of the operator is at least one of graph intermediate data or graph output data, and the execution sequence of a plurality of operators in the operator execution queue is determined according to the dependency relationship of the operators.
The method supports users to self-define the group graph according to operators and constructs an operator execution queue based on topology information of the computation graph, so that the execution graph is constructed in the form of the operator execution queue, and the method can be flexibly applied to all scenes needing batch operator scheduling, including lightweight batch operator scheduling scenes such as acceleration library construction and the like, and meets business requirements.
In some possible implementations, the operator processing device may partition the graph input data, the graph output data, and the graph intermediate data into numbers. Each element record in the operator execution queue calculates the number of one operator in the graph, the input data of the operator, and the number of the output data of the operator. The binding of the tensor number and the operator can be realized, and the tensor number is used as node information of an operator execution queue and can be used as a tensor index in the execution process of the graph when the operator is executed subsequently.
In some possible implementations, the operator processing apparatus may determine the size of the graph workspace (which may also be referred to as a context in some cases) according to the dimension information of the graph input data and the corresponding relationship between the operator corresponding to each element in the operator execution queue and the input data and the output data of the operator. The operator processing means may then apply for memory as the graph workspace according to the size of the graph workspace. The operator processing device can apply for the equipment memory with the corresponding size as the graph working space, or the operator processing device can apply for the equipment memory with the size slightly larger than the corresponding size as the graph working space, so that the defect of insufficient graph working space is avoided.
According to the method, before the operator sequence is executed, the size of the graph working space is deduced based on the corresponding relation between the input dimension and the operator and the input and output data in the operator execution queue, corresponding memory is applied as the graph working space based on the size, multiple applications of memory are not needed, the memory application times are reduced, and the overall operation efficiency is improved.
In some possible implementations, for a graph intermediate data (e.g., a graph intermediate tensor), all operators that rely on the graph intermediate data as input end their lifecycle immediately after their execution. The dimension information of the intermediate data of the graph is larger in a typical application scene, and the memory utilization rate can be improved by multiplexing the memory space of the equipment side as much as possible on the premise of not considering the memory handling. Based on the above, the operator processing device may acquire dimension information of the graph intermediate data and dependency information of the graph intermediate data, where the dependency information is used to indicate an operator that depends on the graph intermediate data, and determine an offset address of the graph intermediate data in the graph workspace and a size of the graph intermediate data according to the dimension information of the graph intermediate data and the dependency information of the graph intermediate data.
According to the method, the offset address of the middle data of the graph in the graph working space and the size of the middle data of the graph are determined by combining the dimension information of the middle data of the graph and the dependency information of the middle data of the graph, so that memory multiplexing can be realized for different middle data of the graph, and the memory utilization rate is improved.
In some possible implementations, the graph intermediate data includes first graph intermediate data and second graph intermediate data. When an operator relying on the first intermediate data is executed before an operator outputting the second intermediate data, the offset address of the first intermediate data is smaller than or equal to the offset address of the second intermediate data.
Because the life cycle of the first diagram intermediate data and the life cycle of the second diagram intermediate data are not overlapped, the second diagram intermediate data can multiplex the memory space of the first diagram intermediate data, and the memory utilization rate is improved.
In some possible implementations, when the dimension information of the second intermediate data is smaller than or equal to the dimension information of the first intermediate data, the second intermediate data may directly multiplex the memory space of the first intermediate data, and when the dimension information of the second intermediate data is greater than the dimension information of the first intermediate data, the offset address of the third intermediate data after the first intermediate data may be adjusted, for example, the offset address of the third intermediate data is increased, so as to additionally reserve a certain memory space for storing the second intermediate data larger than the first intermediate data, thereby implementing multiplexing the memory space of the first intermediate data and improving the memory utilization rate.
In some possible implementations, for the fragmentation parameter (TILLING DATA) of an operator, it is generally required to load the fragmentation parameter from the host side to the designated device memory space before the operator is executed, and the operator is not used after the operator is executed, so as to reduce the number of loading times, and support one-time full loading. Specifically, the operator processing apparatus may further determine an offset address of the slicing parameters of the plurality of operators in the graph workspace and a size of the slicing parameters of the plurality of operators, and load the slicing parameters of the plurality of operators into the graph workspace according to the offset address of the slicing parameters of the plurality of operators in the graph workspace and the size of the slicing parameters of the plurality of operators. Thus, the cost of loading the fragmentation parameter can be reduced, and the resource utilization rate (such as the communication resource utilization rate) can be improved.
In some possible implementations, the graph workspace is further provided with temporary memory. The operator processing means may determine an offset address and a size of a work space of the existing map within the temporary storage based on the data amount of the temporary data generated by the execution of the plurality of operators. For example, the operator processing apparatus may determine a maximum value of the data amount of temporary data generated by the execution of the plurality of operators as the size of the temporary memory. Thus, multiplexing of temporary memory can be realized.
In some possible implementations, the life cycle of the scratch memory (or temporary data) of one operator is from the start of execution of the operator to the end, based on which the scratch memories of different operators can be multiplexed. Specifically, the temporary memory of multiple operators may be set to have the same offset address. Therefore, the temporary memory multiplexing memory space of different operators can be realized, and the memory utilization rate is improved.
In some possible implementations, the operator processing device may execute different computational graphs asynchronously. Specifically, when the operator processing apparatus is still calculating at the device side (such as NPU at the device side), the computing resource of the next computation graph can be prepared in advance, so that the waiting time at the device side can be reduced, and the operator throughput rate can be increased.
In some possible implementations, the computation graph includes a first computation graph and a second computation graph, the first computation graph includes the same operators as the second computation graph, the dependency relationship of the operators in the first computation graph is the same as the dependency relationship of the operators in the second computation graph, and the dimension information of the data in the first computation graph is the same as the dimension information of the data in the second computation graph. Accordingly, the operator processing apparatus may adjust the address of the graph input data to the address of the graph input data of the second computation graph when executing the first computation graph.
Thus, the operator processing device can directly multiplex the graph working space of the first computation graph to execute the second computation graph after the execution of the first computation graph is finished. According to the method, different calculation graphs are executed asynchronously, so that the operator throughput rate of the equipment side is improved, and the overall operation speed is improved.
In some possible implementations, at least one operator of the plurality of operators is a compiled operator, such as a binary operator. Binary operators are computing units in binary coding form, and can be executed on the device side without compiling. By adopting at least one compiled operator to construct the computation graph, the compiling time consumption can be reduced, the lightweight compiling can be realized, and even no compiling time consumption can be realized when the computation graph adopts the compiled operators, so that the operator scheduling can be lightened.
In a second aspect, the present application provides an operator processing apparatus, comprising:
the acquisition module is used for acquiring topology information of a calculation graph, wherein the topology information records a plurality of operators included in the calculation graph, graph input data, graph output data, graph intermediate data and dependency relations of the operators;
The construction module is used for constructing an operator execution queue according to topology information of the computation graph, each element in the operator execution queue corresponds to one operator in the computation graph and input data and output data of the operator, the input data of the operator is at least one of the graph input data or the graph middle data, the output data of the operator is at least one of the graph middle data or the graph output data, and the execution sequence of a plurality of operators in the operator execution queue is determined according to the dependency relationship of the operators.
In some possible implementations, the construction module is further configured to:
dividing the graph input data, the graph output data and the graph intermediate data into partition numbers;
Each element in the operator execution queue records one operator in the computation graph, the number of the input data of the operator and the number of the output data of the operator.
In some possible implementations, the apparatus further includes:
The memory planning module is used for determining the size of the graph working space according to the dimension information of the graph input data and the corresponding relation between the operator corresponding to each element in the operator execution queue and the input data and the output data of the operator, and applying a memory as the graph working space according to the size of the graph working space.
In some possible implementations, the memory planning module is further configured to:
Acquiring dimension information of the graph intermediate data and dependency information of the graph intermediate data, wherein the dependency information is used for indicating an operator depending on the graph intermediate data, and determining an offset address of the graph intermediate data in a graph working space and the size of the graph intermediate data according to the dimension information of the graph intermediate data and the dependency information of the graph intermediate data.
In some possible implementations, the graph intermediate data includes first graph intermediate data and second graph intermediate data, and when an operator that depends on the first graph intermediate data is executed before outputting an operator of the second graph intermediate data, an offset address of the first graph intermediate data is less than or equal to an offset address of the second graph intermediate data.
In some possible implementations, the memory planning module is further configured to:
Determining offset addresses of the slicing parameters of the operators in the graph working space and sizes of the slicing parameters of the operators;
And loading the slicing parameters of the operators into the graph working space according to the offset addresses of the slicing parameters of the operators in the graph working space and the sizes of the slicing parameters of the operators.
In some possible implementations, the graph workspace is further provided with temporary memory;
The memory planning module is further configured to:
And determining the offset address and the size of the map working space in the temporary storage according to the data quantity of temporary data generated by the execution of the operators.
In some possible implementations, the temporary memories of the plurality of operators have the same offset address.
In some possible implementations, the computation graph includes a first computation graph and a second computation graph, the first computation graph includes the same operators as the second computation graph, the dependency relationship of the operators in the first computation graph is the same as the dependency relationship of the operators in the second computation graph, and the dimension information of the data in the first computation graph is the same as the dimension information of the data in the second computation graph;
The apparatus further comprises:
and the adjusting module is used for adjusting the address of the graph input data to be the address of the graph input data of the second calculation graph when the first calculation graph is executed.
In a third aspect, the present application provides an operator processing apparatus. The operator processing means comprises at least one processor and at least one memory. The at least one processor and the at least one memory are in communication with each other. The at least one processor is configured to execute instructions stored in the at least one memory to cause the operator processing apparatus to perform the operator processing method according to the first aspect or any implementation of the first aspect.
In a fourth aspect, the present application provides a computer readable storage medium having stored therein instructions for instructing a computing device or a cluster of computing devices to execute the operator processing method according to any implementation manner of the first aspect or the first aspect.
In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on a computing device or cluster of computing devices, cause the computing device or cluster of computing devices to perform the operator processing method of any implementation of the first aspect or the first aspect described above.
Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.
Detailed Description
The terms "first", "second" in embodiments of the application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature.
Some technical terms related to the embodiments of the present application will be described first.
An Operator (OP) refers to an operation performed on data. Depending on the type of operation, operators can be classified into computational operators (e.g., convolution operator Conv2D, max-pooling operator MaxPool), communication operators (e.g., collection operator ALLGATHER, broadcast operator Broadcast). The operators (e.g., heterogeneous operators) may include host (host) logic and device (device) logic, where the host includes a CPU and host memory, and the device includes an external processing unit (e.g., NPU or GPU) and device memory. The host logic may include logic to generate an execution graph (which may also be referred to as a run graph), which may be a sequence of operators (kernel) that can be executed directly on the device, and the device logic may include logic to run the execution graph.
Operator scheduling refers to a process of controlling operator execution, and comprises the actions of data preparation, task issuing, calculation result acquisition and the like. The operator schedules interactions between the design host side and the device side. As shown in fig. 1, considering that the input data of the operator is already in the memory of the device, the main scheduling time includes the host side deriving the output dimension information (which may also be called output Shape) according to the input dimension (Shape) information, applying for the memory of the device side, calculating the fragmentation parameter TILLING DATA of the operator, carrying TILLING DATA to the device, issuing the calculation task to the device, and performing the calculation task on the device side.
Batch operator scheduling is required in many business scenarios, such as in the context of training neural networks, or in the context of reasoning based on trained neural networks, multiple operators are typically required to be scheduled in batches. An operator scheduling method based on artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) compilation supports batch operator scheduling. Specifically, a computational graph comprising a plurality of operators is compiled into an intermediate representation (INTERMEDIATE REPRESENTATION, IR) by a tensor virtual machine (TensorVirtual Machine, TVM) or accelerated linear algebra (AcceleratedLinearAlgebra, XLA), and then an execution graph can be generated from the IR, by which a batch operator call is implemented.
However, the above method requires a complex compilation optimization process, where the graph fusion process is complex and cumbersome, and requires offline generation of an execution graph, which is a heavyweight operator scheduling technique, and is not applicable to some batch operator call scenarios that require light weight, such as in implementing an acceleration library application program interface (applicationprogramming interface, API) scenario.
In view of this, the present application provides an operator processing method. The method may be performed by an operator processing device. The operator processing means may be a software means, which may be a stand-alone software package, or a cloud service, for example provided through an interface (e.g. a composition interface), supporting a user to represent the topology map in a queue based on operator custom group maps. The software means described above may be deployed in a cluster of computing devices (e.g. a cluster of computing devices executing an operator or a third party computing device) executing program code of the software means for performing the operator processing method of the application. The operator processing apparatus may also be a hardware apparatus, for example a computing device or cluster of computing devices with operator processing functionality. When the operator processing device runs, the operator processing method of the application is executed.
Specifically, the operator processing device acquires topology information of a computation graph, the topology information records a plurality of operators included in the computation graph, graph input data, graph output data, graph intermediate data and dependency relations of the operators, and then constructs an operator execution queue according to the topology information of the computation graph. Each element in the operator execution queue corresponds to one operator in the computation graph, and input data and output data of the operator, wherein the input data of the operator is at least one of graph input data or graph intermediate data, the output data of the operator is at least one of graph intermediate data or graph output data, and the execution sequence of a plurality of operators in the operator execution queue is determined according to the dependency relationship of the operators.
The method supports users to self-define the group graph according to operators and constructs an operator execution queue based on topology information of the computation graph, so that the execution graph is constructed in the form of the operator execution queue, and the method can be flexibly applied to all scenes needing batch operator scheduling, including lightweight batch operator scheduling scenes such as acceleration library construction and the like, and meets business requirements.
The operator processing method can be applied to a scene requiring batch operator calling to complete a calculation task. Typical scenarios include various acceleration library API builds or large model reasoning acceleration library builds. Among other things, acceleration API construction includes deep neural network (Deep Neural Network, DNN) acceleration library API construction, and large model inference acceleration libraries may include various large language models (large language model, LLM), particularly conversational language models.
The operator processing means for performing the operator processing method may be implemented by a computing device or a cluster of computing devices. For example, a computing device or cluster of computing devices may perform the operator processing method of the present application when building a large model inference acceleration library. Wherein the computing device may be a server or a terminal. The server may be a cloud server in a cloud environment or an own physical server, and the terminal may include, but is not limited to, a desktop, a notebook, a smart phone. The computing device may employ a single-machine single-card architecture, or a single-machine multi-card architecture, and the computing device cluster may be a multi-machine multi-card architecture. In the x-machine x-card, the "machine" represents a host, and the "card" represents a device such as an accelerator card. The accelerator card may include, but is not limited to, a GPU or NPU, among others. The following is an example of a computing device as a server.
Referring to the hardware architecture of a server shown in fig. 2, the server 20 includes a host 22 and at least one device 24. Wherein the host 22 and the at least one device 24 are connected.
Host 22 includes a processor, which may be a central processing unit (centralprocessing unit, CPU), and memory, which may be a Dual In-line Memory Module (DIMM). The DIMM may specifically be a double data rate (doubledata rate, DDR) type, for example, the memory may be a DDR4 DIMM. In the example of FIG. 2, host 22 includes 4 CPUs and 4 DDR4 DIMM groups, each CPU is connected to 1 DDR4 DIMM group, each DDR4 DIMM group includes 8 DDR4 DIMMs. Multiple CPUs of the host 22 can be connected to form HYDRA MESH.
Optionally, the host 22 further includes an interface, such as one or more of a serial advanced technology attachment (SerialAdvancedTechnologyAttachment, SATA) interface, a new generation Non-volatile memory (Non-Volatile Memory express, NVMe) interface, and a Gigabit Ethernet (GE) interface. The host 22 may also include memory. The memory may include SATA-enabled memory or NVMe-enabled memory, such as a SATA-enabled mechanical hard disk, or NVMe-enabled solid state disk (solid STATE DRIVE, SSD).
The device 24 includes an accelerator card, such as an NPU or GPU, and FIG. 2 illustrates that the device side includes 8 NPUs. In other possible implementations of the embodiment of the present application, the device side may also include more accelerator cards. For example a greater number of accelerator cards, or a greater type of accelerator card.
When the device side includes multiple accelerator cards, the multiple accelerator cards may concurrently execute multiple operators. Wherein the plurality of operators comprises a communication operator, and the plurality of accelerator cards can directly exchange data through the communication operator to execute the plurality of operators simultaneously. When the device side includes one accelerator card, the accelerator card can independently execute a plurality of operators.
In order to make the technical scheme of the application clearer and easier to understand, the operator processing method of the application is described below with reference to the accompanying drawings.
Referring to a flow chart of an operator processing method shown in fig. 3, the method may be performed by an operator processing apparatus. The operator processing apparatus may be a software apparatus, for example, a software apparatus deployed on a host side, or a software apparatus deployed on a computing device cluster, for constructing a plurality of operators requiring batch scheduling into an operator execution queue for device side execution. When the operator processing device is a software device deployed in a computing device cluster, a computing device executing an operator may call an interface of the software device to obtain an operator execution queue, and execute a plurality of operators on a device side based on the operator execution queue. The operator handling means may also be hardware means, e.g. a host on the computing device side, or a cluster of computing devices. The method comprises the following steps:
S302, the operator processing device acquires topology information of the computation graph.
A computation graph is a topology graph that describes the computation process. The computational graph generally includes operators involved in the computational process, input data for the operators, output data for the operators. For the whole computational graph, the topology information of the computational graph records the dependency relationship of a plurality of operators, graph input data, graph output data, graph intermediate data and a plurality of operators included in the computational graph. The graph input data may be input data of a plurality of operators included in the computation graph, and the graph output data may be output data of a plurality of operators included in the computation graph. The graph intermediate data may be output data of some (or one) of the plurality of operators and input data of other (or another) of the plurality of operators comprised by the computational graph. Wherein the input data, output data or graph input data, graph output data, and graph intermediate data of the operator may be tensors (tensor). The tensor may be an array having any integer dimension, and various attributes describing this array, including but not limited to dimension (shape) information, data type, device type, data format, memory address, etc.
The computational graph may be a user-defined build. In particular, the operator processing apparatus may provide a composition interface, which may be a graphical user interface (GRAPHICAL USER INTERFACE, GUI), or a command user interface (commanduser interface, CUI), or an API through which a user can customize a computational graph. For ease of description, the GUI is illustrated with a composition interface.
Referring to the schematic diagram of a composition interface shown in fig. 4, the composition interface provides multiple operators for constructing a computation graph, a user can add multiple operators to be scheduled in batches to a composition area in a dragging mode or the like, further, the user can view input data and output data of the multiple operators, can determine dependency relationships of the multiple operators based on the input data and the output data of the multiple operators, and can connect the multiple operators based on the dependency relationships, so that the computation graph is self-defined and constructed. Accordingly, the operator processing apparatus may acquire topology information of the computation graph.
S302, the operator processing device constructs an operator execution queue according to topology information of the computation graph.
An operator execution queue is a queue for executing operators. Each element in the operator execution queue corresponds to an operator in the computational graph and input data and output data of the operator. The input data of the operators are at least one of graph input data or graph intermediate data, the output data of the operators are at least one of graph intermediate data or graph output data, and the execution sequence of a plurality of operators in an operator execution queue is determined according to the dependency relationship of the plurality of operators.
Specifically, the operator processing apparatus may sort the plurality of operators according to their dependencies, and then sequentially add the plurality of operators to the operator execution queue. Wherein each operator may correspond to an element in an operator execution queue. The elements in the operator execution queue also record the input data and output data of the operator.
In some possible implementations, the operator execution queue may be a first-in first-out (First InputFirst Output, FIFO) queue. Accordingly, the operator processing device can add the operators executed in advance to the FIFO queue, so that the execution sequence of the operators can be ensured to meet the dependency relationship among the operators.
The operator processing device also supports partitioning numbering of graph input data, graph output data and graph intermediate data. Accordingly, each element in the operator execution queue records the number of one operator in the calculation graph and the input data and the number of the output data of the operator, thereby realizing the record of the operator in the calculation graph and the input data and the output data of the operator.
Further, the operators in the operator execution queue may include at least one compiled operator. For example, the operators in the operator execution queue may be binary operators. Binary operators are computing units in binary coding form, and can be executed on the device side without compiling. When the operators in the operator execution queue are binary operators, the operator sequence may also be referred to as a kernel sequence. Therefore, the problem of overweight AI compiling flow can be solved, and operator scheduling can be realized through mild compiling and even compiling-free.
For ease of understanding, the following description is provided in connection with an example.
As shown in fig. 5, the computational graph includes binary operators OP1 through OP5. The computation graph also records a graph input tensor (InTensor), a graph output tensor (OutTensor), and a graph intermediate tensor (InterTensor), wherein the graph input tensor is the input tensor of OP1, OP2 and OP5, the graph output tensor is the output tensor of OP5, the graph intermediate tensor comprises the output tensors of OP1 and OP2 (the output tensor of OP1 is the input tensor of OP3 at the same time, the output tensor of OP2 is the input tensor of OP3 and OP4 at the same time), and the graph intermediate tensor further comprises the output tensor of OP3 (the output tensor of OP3 is the input tensor of OP4 at the same time) and the output tensor of OP4 (the output tensor of OP4 is the input tensor of OP5 at the same time).
The graph input tensor, the graph output tensor and the graph intermediate tensor can be divided into partition numbers. For example, the above-mentioned graph input tensors may be numbered 1 to 3, the graph output tensors may be numbered 4, and the middle tensors may be numbered 5 to 8. The operator processing device determines that the OP1 and the OP2 are executed before the OP3, the OP3 is executed before the OP4, and the OP4 is executed before the OP5 based on the dependency relationship of the operators in the topology information, so that the OPs 1 to OP5 can be added to an operator execution queue in sequence, and the input tensor of the operator and the tensor number of the output tensor are bound with the operator to be used as FIFO queue node information. The input tensor and the tensor number of the output tensor of the operator can be used as tensor indexes in the execution process of the graph.
Note that, in the above example, OP2 may also be added to the operator execution queue prior to OP 1. In other words, OP2 may also be performed prior to OP 1. In addition, the operator execution queue may be another type of queue, which is not limited in this embodiment. When an operator is concurrently executing on multiple devices, the operators in the operator execution queue may include a communication operator that is used to exchange data between different devices (e.g., different NPUs).
Based on the description, the operator processing method of the application can support user-defined group graph, and can construct an operator execution queue according to the topology information of the well-constructed calculation graph, thereby realizing the construction of the execution graph in the form of the operator execution queue, being flexibly applied to all scenes needing batch operator scheduling, and meeting the service requirement.
Furthermore, the application also supports the memory layout of internal computing resources of the computation graph to be planned in advance according to the input data and the operator execution queue, and the memory multiplexing is carried out according to the life cycle of the computation graph, so that the memory utilization rate is improved, and the overall memory consumption is greatly reduced. In addition, the method supports one-time application of the large-block memory, reduces the application times of the memory, and improves the overall operation efficiency. The resource planning process is described in detail below.
An operator execution typically requires input data (e.g., input tensor InTensors), output data (e.g., outTensors), slice parameters (TILLING DATA), scratch memory (scratch memory). The slicing parameters are parameters for slicing the input data, and the slicing parameters may include a total length of parameters (totalLength), a slicing number (for example, a number of times of processing the data in a single-core loop, denoted as tileNum), and a calculation scalar (for example, leakyRelu scalar). The temporary memory is used for temporarily storing temporary data generated by operator execution. For a user of a graph, attention is usually paid to the graph input tensor, the graph output tensor and the graph workspace (GraphWorkspace), so that the intermediate tensor, TILLING DATA and the scratch memory of the graph need to be put into the graph workspace uniformly and transmitted from outside the graph.
Wherein the graph workspace refers to a workspace (workspace) for graph execution, the workspace may also be referred to as a context, session. Execution of the computational graph requires that in the context of a Session that provides an environment for operator execution and tensor evaluation. A Session may have some resources, such as Variable or Queue. When the session is no longer needed, a session () may be called to close the session (or automatically using the context manager) to release the resource. If there are multiple computing graphs to be executed, different Sessions may be created to load each computing graph, and each computing graph may be loaded in multiple Sessions for computation, where Sessions are independent of each other.
Based on the above, the operator processing device may determine the size (size) of the graph workspace according to the dimension information of the graph input data and the corresponding relationship between the operator corresponding to each element in the operator execution queue and the input data and the output data of the operator, and apply for the memory as the graph workspace according to the size of the graph workspace.
The operator processing device can perform chained deduction according to shape information of graph input data (such as graph input tensor) and in combination with corresponding relations of the input data and the output data of the operator, so as to deduce shape information of the graph intermediate tensor, and further obtain memory size for storing the graph intermediate tensor. In addition, the operator processing device can deduce the data quantity of temporary data generated in the execution process, and further estimate the size of the temporary memory. The operator processing device can obtain the size of the whole graph working space according to the memory size of the intermediate tensor of the storage graph, the size of the temporary storage memory and the memory size of the storage fragmentation parameter. It should be noted that the graph workspace may also be divided into an operator workspace (OpWorkspace) and a graph intermediate data area. The operator working space comprises a slicing parameter area and a temporary storage memory, wherein the slicing parameter area is used for storing slicing parameters, and the temporary storage memory is used for storing temporary data generated in the operator executing process.
As illustrated in fig. 6, the operator processing apparatus may chain derive shape information of the intermediate tensor of the graph according to shape information of the input tensor of the graph (for example, tensor IDs are 1, 2, and 3), so as to obtain a size of the memory required for storing the intermediate tensor of the graph. In addition, the operator processing device can also deduce the size of the fragmentation parameter demand memory of each operator and the size of the temporary memory for temporarily storing temporary data generated in the execution process of each operator. The operator processing device can predict the size of the graph working space according to the size of the tensor required memory in the middle of the storage graph, the size of the storage fragmentation parameter required memory and the size of the temporary storage memory, and apply the memory block with the corresponding size to the equipment side as the graph working space. It should be noted that, the operator processing apparatus may apply for a memory block slightly larger than the size as the map workspace.
The following describes the memory planning of slice parameters, temporary memory, and intermediate data.
For the slicing parameters (TILLING DATA) of an operator, the slicing parameters are usually required to be loaded from the host side to the designated device memory space before the operator is executed, and the operator is not used after the operator is executed, so that the loading times are reduced, the application supports one-time full loading, and correspondingly, the life cycle of the slicing parameters can be equal to the life cycle of a graph. For a scratch memory (scratch memory) of an operator, the lifecycle may be the time from the start to the end of execution of the corresponding operator. Based on this, the scratch memories of multiple operators can be multiplexed.
In particular implementation, the operator processing apparatus may determine an offset address (denoted as offset) of the slicing parameters of the plurality of operators in the graph workspace and a size (size) of the slicing parameters of the plurality of operators, and load the slicing parameters of the plurality of operators into the graph workspace according to the offset address of the slicing parameters of the plurality of operators in the graph workspace and the size of the slicing parameters of the plurality of operators. For the temporary memory, the operator processing device can determine the offset address and the size of the working space of the existing map in the temporary memory according to the data quantity of temporary data generated by the execution of a plurality of operators. The size of the temporary memory may be the maximum value of the data amount of the temporary data. When the operator is executed, the temporary data generated by the operator can be stored in the corresponding position or area by transferring the offset address and the size of the map workspace in the temporary storage into the map workspace. It should be noted that, the temporary memory of multiple operators may have the same offset address, so that memory multiplexing may be implemented, and memory overhead may be reduced.
Illustrated in fig. 7. The TILLING DATA of the operators occupy less device-side memory space, and in order to reduce the number of times of loading of the ops TILLING DATA, TILLING DATA of all operators in the operator execution queue can be loaded all at once. Wherein TILLING DATA of all operators can be placed at the start address of Graph workbench. In other words, TILLING DATA may have an offset equal to GraphWorkspace. the size of the TILLING DATA region may be equal to the size of TILLING DATA for all operators in the operator execution queue. The scratch memory of each operator is not used after the operator execution is finished, so that the following operator execution can multiplex scratchmemory of the previous operator. Therefore, the offset of the scratch memory of all operators in the Graph work space is the same, but the respective sizes are different, and only the largest size needs to be taken.
Wherein the operator processing apparatus may scan the operator execution queue, and write the offset and size of TILLING DATA and the scratch memory of each operator to the running information (denoted as runInfo) of the operator for use in the subsequent execution of the operator.
For a graph intermediate data (e.g., a graph intermediate tensor), the lifecycle of all operators relying on the graph intermediate data as input ends immediately after their execution. Dimension information of the intermediate data of the graph is larger in a typical application scene, and memory utilization rate can be improved by multiplexing the memory space of the equipment side as much as possible on the premise of not considering memory handling.
Specifically, the operator processing device may acquire dimension information of the graph intermediate data and dependency information of the graph intermediate data, where the dependency information is used to indicate an operator that depends on the graph intermediate data, and then the operator processing device may determine an offset address of the graph intermediate data in the graph workspace and a size of the graph intermediate data according to the dimension information of the graph intermediate data and the dependency information of the graph intermediate data.
The operator processing device can determine the life cycle of each graph intermediate data according to the dependency information of the graph intermediate data, and then the operator processing device can determine the arrangement of the graph intermediate data in the memory based on the dimension information of the graph intermediate data and the life cycle of each graph intermediate data, so that the offset address of the graph intermediate data in the graph working space and the size of the graph intermediate data are obtained. For example, the intermediate data includes first intermediate data and second intermediate data, and when an operator relying on the first intermediate data is executed before an operator outputting the second intermediate data, the offset address of the first intermediate data is less than or equal to the offset address of the second intermediate data, i.e., the second intermediate data may multiplex the memory space of the first intermediate data.
Illustrated in fig. 8. The intermediate tensor (i.e., tensor 5) of the output graph is executed by the OP1, and the tensor 5 is relied on by the subsequent OP3, so that the memory space of the tensor 5 is not released. The intermediate tensor (i.e., tensor 6) of the output graph after the OP2 is executed is relied on by the subsequent OP3, so that the memory space of the tensor 6 is not released. At this time, neither the memory space of tensor 5 nor 6 is released. OP3 executes dependent tensor 5 and tensor 6, and outputs intermediate tensor (i.e., tensor 7) of the graph after execution. At this time, the tensor 5 is not relied on by the subsequent operator, and the memory space of the tensor 5 can be released. OP3 executes dependent tensor 6 and tensor 7, and outputs intermediate tensor (i.e., tensor 8) of the graph after execution. The memory space used by tensor 8 is smaller than the memory space used by tensor 5, and tensor 8 can multiplex the memory space of tensor 5. Based on this, the operator processing means may configure the offset of tensor 8 to be equal to the offset of tensor 5. In other possible implementations of the embodiment of the present application, the operator processing apparatus may also configure the offset of the tensor 8 to be larger (e.g. slightly larger) than the offset of the tensor 5, that is, the offset of the tensor 5 is smaller than the offset of the tensor 8, so as to implement memory multiplexing of the tensor 5 and the tensor 8. The OP5 execution finishes and all the inter-graph tensors' memory space can be freed.
When the memory used by the tensor 8 is larger than the memory used by the tensor 5, for example, the memory used by the tensor 8 is slightly larger than the memory used by the tensor 5, which can cause that the memory space of the tensor 5 cannot be reused and can only be expanded to the right, in order to avoid the situation, the operator processing device can merge adjacent idle blocks (i.e. the memory blocks of which the tensor is released), and select the minimum size block meeting the requirement as much as possible each time, for example, select the minimum size block meeting the requirement through the BestFit algorithm for multiplexing, so as to reserve more and larger blocks as much as possible and improve the memory multiplexing rate. In addition, the operator processing device may also plan in advance according to global information, for example, shift the starting addresses of the tensor 6 and the tensor 7 to the right in advance, so that after the subsequent tensor 5 is released, the tensor 8 can be placed through merging of adjacent fragments.
After completion of the memory planning described above, an operator execution queue (this process is also referred to as graph execution) may be executed. As shown in fig. 9, the operator execution in the operator execution queue requires preparation of all operator's run information runInfo first, the main flow is to update the data pointer of the intermediate tensor of the graph, and the operator's workspace pointer. Because the operator processing device has stored the offset of each resource block during memory planning, based on this, the operator processing device only needs to add the offset to the base address of the workspace memory that is externally input when updating the pointers (data pointer, workspace pointer).
Further, referring to fig. 10, in the updating process, the operator processing apparatus may determine whether the input tensor and the output tensor of each operator belong to the intermediate tensor of the graph according to the area of the tensor in the tensor table of the graph, and since the data pointers of the input tensor and the output tensor of the graph do not need to be updated, and are real addresses, the operator processing apparatus may not perform the updating operation, and may perform the updating operation when determining that the input tensor and the output tensor of the operator belong to the intermediate tensor of the graph.
In some possible implementations, the operator processing device may support parallel execution of the computational graph. The operator processing apparatus may asynchronously perform graph computation, and prepare computing resources of a next computation graph in advance while the device side (such as NPU on the device side) is still computing, which reduces latency on the device side and increases operator throughput rate. Referring to fig. 11, in a different graph, or a different shape scene from the graph, the operator processing apparatus may asynchronously perform resource planning of the computing graph 2 (graph 2), such as offset and size of computing resource blocks, when carrying TILLING DATA of the computing graph 1 (graph 1) to the device side at the host side.
Further, the application also supports multiplexing graph workspaces for the same-graph and same-shape scene. Specifically, the computation graph includes a first computation graph (e.g., graph 1) and a second computation graph (e.g., graph 2), the first computation graph includes the same operators as the second computation graph, the dependency relationship of the operators in the first computation graph is the same as the dependency relationship of the operators in the second computation graph, and the dimension information of the data in the first computation graph is the same as the dimension information of the data in the second computation graph. When executing the first computational graph, the operator processing device may adjust the address of the graph input data to the address of the graph input data of the second computational graph. Thus, the operator processing device can directly multiplex the graph working space of the first computation graph to execute the second computation graph after the execution of the first computation graph is finished. According to the method, different calculation graphs are executed asynchronously, so that the operator throughput rate of the equipment side is improved, and the overall operation speed is improved.
Based on the operator processing method, the application provides an operator processing device. The operator handling means will be described below in terms of functional modularity.
Referring to a schematic structure of an operator processing apparatus shown in fig. 12, the apparatus 1200 includes:
an obtaining module 1202, configured to obtain topology information of a computation graph, where the topology information records a plurality of operators included in the computation graph, graph input data, graph output data, graph intermediate data, and dependency relationships of the plurality of operators;
The constructing module 1204 is configured to construct an operator execution queue according to topology information of the computation graph, where each element in the operator execution queue corresponds to one operator in the computation graph and input data and output data of the operator, the input data of the operator is at least one of the graph input data or the graph intermediate data, the output data of the operator is at least one of the graph intermediate data or the graph output data, and an execution order of a plurality of operators in the operator execution queue is determined according to a dependency relationship of the plurality of operators.
The acquiring module 1202 is configured to implement the S302 related content in the foregoing method embodiment, and the constructing module 1204 is configured to implement the S304 related content in the foregoing method embodiment.
The acquisition module 1202 and the construction module 1204 may be implemented by software or hardware.
When implemented in software, the acquisition module 1202, the construction module 1204 may be an application running on a computing device. For example, the construction module 1204 may be a computing engine running on a computing device. The application may be provided for use by the user in the manner of a virtualized service. The virtualization service may include a Virtual Machine (VM) service, a Bare Metal Server (BMS) service, and a container service. The VM service may be a service that virtualizes a Virtual Machine (VM) resource pool on a plurality of physical hosts (such as computing devices) through a virtualization technology to provide a user with a VM for use on demand. The BMS service is a service for virtualizing a BMS resource pool on a plurality of physical hosts to provide a user with BMS for use on demand. A container service is a service that virtualizes a pool of container resources on multiple physical hosts to provide users with containers for use on demand. A VM is a virtual computer that is modeled, i.e., a computer that is logically. The BMS is elastically telescopic high-performance computing service, has no difference between computing performance and traditional physical machines, and has the characteristic of safe physical isolation. The container is a kernel virtualization technology, and can provide lightweight virtualization to achieve the purpose of isolating user space, processes and resources. It should be understood that the VM service, the BMS service, and the container service in the above-mentioned virtualization service are merely specific examples, and in practical applications, the virtualization service may be other lightweight or heavy-weight virtualization services, which are not specifically limited herein.
When implemented in hardware, the acquisition module 1202, the construction module 1204 may include at least one processor, such as a CPU or the like. Alternatively, the acquisition module 1202 and the construction module 1204 may be devices implemented using application-specific integrated circuits (ASIC), programmable logic devices (programmable logic device, PLD), or the like. The PLD may be implemented as a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, a general-purpose array logic (GENERIC ARRAY logic, GAL), or any combination thereof.
In some possible implementations, the constructing module 1204 is further configured to:
dividing the graph input data, the graph output data and the graph intermediate data into partition numbers;
Each element in the operator execution queue records one operator in the computation graph, the number of the input data of the operator and the number of the output data of the operator.
In some possible implementations, the apparatus 1200 further includes:
the memory planning module 1206 is configured to determine a size of a graph working space according to dimension information of graph input data and a corresponding relationship between an operator corresponding to each element in the operator execution queue and input data and output data of the operator, and apply for a memory as the graph working space according to the size of the graph working space.
Similar to the construction module 1204, the memory planning module 1206 may be implemented in software, or in hardware.
When implemented in software, the memory planning module 1206 may be an application running on a computing device. For example, the memory planning module 1206 may be a computing engine running on a computing device. The application may be provided for use by the user in the manner of a virtualized service. For example, the application may be provided for use by a user through a VM service, a BMS service, or a container service.
When implemented in hardware, the memory planning module 1206 may include at least one processor, such as a CPU or the like. Alternatively, the memory planning module 1206 may be a device implemented with an application specific integrated circuit ASIC, or a programmable logic device PLD (e.g., CPLD, FPGA), or the like.
In some possible implementations, the memory planning module 1206 is further configured to:
Acquiring dimension information of the graph intermediate data and dependency information of the graph intermediate data, wherein the dependency information is used for indicating an operator depending on the graph intermediate data, and determining an offset address of the graph intermediate data in a graph working space and the size of the graph intermediate data according to the dimension information of the graph intermediate data and the dependency information of the graph intermediate data.
In some possible implementations, the graph intermediate data includes first graph intermediate data and second graph intermediate data, and when an operator that depends on the first graph intermediate data is executed before outputting an operator of the second graph intermediate data, an offset address of the first graph intermediate data is less than or equal to an offset address of the second graph intermediate data.
In some possible implementations, the memory planning module 1206 is further configured to:
Determining offset addresses of the slicing parameters of the operators in the graph working space and sizes of the slicing parameters of the operators;
And loading the slicing parameters of the operators into the graph working space according to the offset addresses of the slicing parameters of the operators in the graph working space and the sizes of the slicing parameters of the operators.
In some possible implementations, the graph workspace is further provided with temporary memory;
The memory planning module 1206 is further configured to:
And determining the offset address and the size of the map working space in the temporary storage according to the data quantity of temporary data generated by the execution of the operators.
In some possible implementations, the temporary memories of the plurality of operators have the same offset address.
In some possible implementations, the computation graph includes a first computation graph and a second computation graph, the first computation graph includes the same operators as the second computation graph, the dependency relationship of the operators in the first computation graph is the same as the dependency relationship of the operators in the second computation graph, and the dimension information of the data in the first computation graph is the same as the dimension information of the data in the second computation graph;
the apparatus 1200 further comprises:
the adjusting module 1208 is configured to adjust an address of the graph input data to an address of the graph input data of the second computation graph when the first computation graph is executed.
Similar to the acquisition module 1202 and the construction module 1204 described above, the adjustment module 1208 may be implemented in software or in hardware.
When implemented in software, the adjustment module 1208 may be an application running on the computing device. The application may be provided for use by the user in the manner of a virtualized service. For example, the application may be provided for use by a user through a VM service, a BMS service, or a container service.
When implemented in hardware, the adjustment module 1208 may include at least one processor, such as a CPU or the like. Alternatively, the adjustment module 1208 may be a device implemented using an ASIC, a PLD (e.g., CPLD, FPGA), or the like.
FIG. 12 illustrates an operator handling apparatus from a functional modularity perspective, and an operator handling apparatus from a hardware materialization perspective is described below.
The operator processing apparatus may comprise at least one processor and at least one memory, the at least one memory having stored therein computer readable instructions, the at least one processor executing the computer readable instructions to cause the operator processing apparatus to perform the operator processing method of the previous embodiment.
The operator processing means may be a computing device, such as the server shown in fig. 2. Further, the operator processing apparatus may also be a computing device cluster formed by a plurality of computing devices, for example, a computing device cluster formed by a plurality of servers as shown in fig. 2. The memory of a plurality of computing devices in the computing device cluster may have stored therein instructions of the same operator processing apparatus 1200 for performing the operator processing method.
In some possible implementations, multiple computing devices in the cluster of computing devices may also be used to execute part of the instructions of the operator processing apparatus 1200 for performing the operator processing method. In other words, a combination of multiple computing devices may collectively execute instructions of operator processing apparatus 1200 for performing an operator processing method.
It should be noted that, memories in different computing devices in the computing device cluster may store different instructions for performing part of the functions of the operator processing apparatus 1200.
The embodiment of the application also provides a computer readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that instruct a computing device to perform the above-described operations applied to the operator processing apparatus 1200 for performing an operator processing method.
Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform the operator processing method described above.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the protection scope of the technical solution of the embodiments of the present invention.