AI system, memory access control method and related equipmentTechnical FieldThe present invention relates to the field of electronic devices, and in particular, to an AI system, a memory access control method, and related devices.
BackgroundIn recent years, in the field of information communication technology (Information and Communication Technology, ICT) computing, artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is one of the most hot technologies, and increasingly powerful software and hardware products are continuously introduced by all global top-level software and hardware manufacturers, so that rapid progress of the artificial intelligence software and hardware technology is promoted. On the one hand, in terms of hardware computing power, the computing power of a single Chip for AI computing is from 10 trillion times of operations per second (Tera Operations Per Second, TOPS) to approximately 1000 TOPS, and through various communication interconnection technologies, the computing power of a single System on Chip (SoC) with strong computing power can be formed into an AI computing cluster server with thousands or even hundreds of thousands of SoCs as cores, so that the high-speed and high-precision requirements of training or reasoning computing tasks of various artificial intelligent networks can be met.
On the other hand, with the increasingly deep and wide application and popularization of AI and internet technologies in various industries, and the large amount of data formed in production and life of various industries, the scale and complexity of problems solved by artificial intelligence such as automatic driving, natural language processing, machine learning and the like become more huge, so that the scale and complexity of AI computing network models are promoted to increase in geometric progression scale. For example, the pre-trained language model proposed by OpenAI (GENERATIVE PRE-trained Transformer, GPT), model scale developed as follows in table 1:
TABLE 1
| Model | Publication time | Quantity of parameters | Pre-training data volume |
| GPT | 2018, 6 Months | 1.17 Million | About 5GB |
| GPT-2 | 2019, 2 Month | 15 Billion | About 40GB |
| GPT-3 | 2020 Month 5 | 1750 Million | About 45TB |
Generally, AI computing can be divided into two main categories from application processes: training and reasoning. To accomplish this large-scale training of AI network models, a set of high-performance AI computing clusters must be used to achieve business objectives within an acceptable time scale. In the prior art, in the process of completing such calculation tasks, a great amount of calculation power is required to be applied to an AI calculation cluster server, and a great amount of input, output and intermediate data are generated in the calculation process, such as various sample data (text, voice, image and video), various weights/parameters of a neural network, gradient data, feature maps (feature maps) obtained in the model training process and the like. These data are often stored in high-speed memory on the SoC. For example, there are a large number of concurrent compute hardware units on the SoC of AI computation, and the AI chip needs to frequently access memory data on the SoC during computation, such as temporarily storing data in a memory or reading data from a memory, and memory bandwidth is often a key bottleneck point affecting AI computation performance. For another example, in the training reasoning process, because large-scale model computation needs to be completed, the AI clusters are often matched together to perform training tasks, in the computation process, concurrent communication flows (such as model feature graphs, parameter weights) and AI computation flows exist in each service node (server) of the clusters and between servers at the same time, and data flows of various special hardware accelerators (such as da vinci vision preprocessors (Davinci Vision Pre-Processor, DVPP), sound signal processors (Audio Signal Processing, ASP), image signal processors (IMAGE SIGNAL Processing, ISP) and the like) are also caused to severely degrade in the process of accessing memory if the data flows are not controlled.
Therefore, how to efficiently and reasonably utilize precious memory bandwidth on SoC to improve AI computing performance is one of the problems to be solved.
Disclosure of Invention
The embodiment of the application provides an AI system, a memory access control method and related equipment, so as to improve the calculation performance of the AI system.
In a first aspect, an embodiment of the present application provides an artificial intelligence AI system, which is characterized by comprising an AI system on a chip SoC, where the AI SoC includes M subsystems and N memory controllers, and the M subsystems and the N memory controllers are interconnected through an SoC bus; the M subsystems comprise target subsystems, the target subsystem is any subsystem in the M subsystems, the target subsystem comprises S processing nodes, and M, N, S are integers which are larger than or equal to 1; wherein, the target processing node in the S processing nodes is configured to: receiving a computing task to be executed, wherein the computing task carries a quality of service identifier QoS ID; the target processing node is any one of the S processing nodes; the QoS ID is used for indicating the category to which the computing task belongs; generating a memory access request of the computing task, wherein the memory access request carries the QoS ID; the access request is sent to a target memory controller in the N memory controllers; the target memory controller is configured to: receiving the access request, and determining a first quality of service (QoS) priority corresponding to the QoS ID carried in the access request; and performing access QoS control on the access request based on the first QoS priority.
In the AI computing field, the embodiment of the application introduces an on-chip memory access quality of service (Quality of Service, qoS) control technique, by marking QoS for each computing task to be allocated to the AI SoC in the AI system, and the QoS IDs corresponding to the computing tasks of different classes are different (for example, classifying according to the traffic flow to which the computing task belongs, classifying according to the different access delay requirements of the computing task, etc.), so that the QoS priority corresponding to the access request of each computing task can be determined according to the QoS ID carried in each computing task, And finally, qoS control is carried out on each access request based on the determined QoS priority, thereby realizing the purpose of carrying out access QoS control on the memory of the AI system from the granularity of the calculation task, simultaneously realizing the access requirements of the calculation tasks aiming at different categories (such as different service flows), providing different functions of memory access service guarantee, and finally obtaining better AI calculation performance on the basis of the existing calculation power and memory bandwidth resources of the AI system. In comparison with the prior art, access control is only performed on the level of a processing node in the SoC (i.e. all access requests on the same processing node are controlled according to the unified access service quality), so that the actual access requirements of various different computing tasks (such as computing tasks belonging to different service flows) in the AI system cannot be met, and the problem of poor computing performance of the AI system is finally caused. Specifically, in the embodiments of the present application, when a processing node (in the present application, for convenience of description, the processing node may be described by taking a Master as an example, and not described in detail later) in each subsystem in the AI SoC is allocated with a computing task, a QoS ID is carried in the corresponding computing task, that is, the QoS ID carried in the computing task is used to represent a class to which the computing task belongs, and according to the class to which the computing task belongs, a priority of a memory access QoS corresponding to the computing task may be finally determined, which is based on that in the AI computing field, the computing tasks of different classes (such as computing tasks under different service flows) have different demands on the quality of service of the memory access, And there is access competition between some kinds of computing tasks and there is no access competition between some kinds of computing tasks, so, the access request of the computing task is set with a matched QoS priority according to the belonging kind, so that the access requirements of the computing tasks of different kinds can be better satisfied (it can be understood that the computing tasks of different kinds can correspond to different QoS IDs, but different QoS IDs may correspond to the same QoS priority and different QoS priorities); Further, in the process that each processing node executes the received computing task, it may generate a memory access request of each computing task according to the memory address and data that each computing task needs to access, and continuously carry the QoS ID carried in the computing task in the memory access request, that is, the QoS ID flows along with the computing task to the corresponding memory access request, so that when the subsequent memory controller receives the memory access request, it may perform memory access control of corresponding priority for the memory access request according to the QoS ID carried by the subsequent memory controller, for example, the higher the QoS priority corresponding to the QoS ID, the memory controller may provide better memory access service quality for the memory access request carrying the QoS ID, Different access QoS control is performed for calculation tasks with different access priority demands, so that serious system performance degradation caused by random preemption of key memory bandwidth resources due to indiscriminate treatment in the prior art is avoided. In summary, the embodiment of the application finally realizes the control of the memory access service quality from the granularity of the calculation task, solves the problem of insufficient memory bandwidth caused by the concurrent competition of various calculation tasks (such as different types of service flows) in the memory bandwidth process in the AI training and reasoning task, and can preferentially ensure the calculation task with higher delay requirement in the AI training and reasoning so as to utilize the memory bandwidth resource in a larger and more efficient way, finally realizes the load balance of the memory access of the whole AI system, And the comprehensive execution performance and efficiency of the whole AI system are improved.
In a possible implementation manner, the computing task further carries a second QoS priority corresponding to the QoS ID, where the second QoS priority is an initial QoS priority corresponding to the QoS ID in the computing task.
In the embodiment of the present application, in addition to the QoS ID of the computing task, the computing task allocated to each processing node (e.g., master) in the subsystem in the AI system may also carry an initial QoS priority (i.e., a second QoS priority) corresponding to the QoS ID in the computing task. That is, in the embodiment of the present application, the QoS ID and the corresponding initial QoS priority may be configured for the computing task at the beginning of the computing task allocation, so that the subsequent adjustment and control of the subsequent QoS priority may be performed according to the QoS ID and the initial QoS priority, and the corresponding memory QoS access control may be performed. Optionally, in the embodiment of the present application, the QoS ID carried in the memory request of the computing task may remain unchanged during the transfer from the Master to the target memory controller, but the corresponding QoS priority may be adjusted and optimized differently according to the sum of different requirements of the memory request during the scheduling.
In a possible implementation manner, the target subsystem further comprises a sub-scheduler; the target processing node is specifically configured to: the access request is sent to the sub-scheduler, and the sub-scheduler is used for scheduling the access request to the target memory controller in the N memory controllers; the sub-scheduler is configured to: receiving access requests sent by the S processing nodes in the target subsystem respectively; scheduling the access requests respectively sent by the S processing nodes to the SoC bus according to second QoS priorities corresponding to QoS IDs carried in the access requests respectively sent by the S processing nodes, wherein the second QoS priorities are initial QoS priorities corresponding to the QoS IDs; the second QoS priority is used to indicate a priority of the corresponding access request being scheduled to the SoC bus.
In the embodiment of the application, each subsystem in the AI SoC in the AI system further comprises a sub-scheduler, which can be used for scheduling the memory access requests of the computing tasks being executed in all processing nodes (such as masters) in the subsystem, and the memory access requests generated by the masters in the subsystems are sent to the SoC bus for arbitration, address analysis and routing after being scheduled by the sub-schedulers in the subsystems, and then are issued to the corresponding memory controllers for memory access. Because the access requests of the computing tasks executed in each Master carry the QoS IDs in the corresponding computing tasks, in the process of scheduling the access requests, the sub-schedulers in each subsystem can schedule the access requests carrying the QoS IDs with higher QoS priorities to the SoC bus preferentially, and delay the access requests carrying the QoS IDs with lower QoS priorities to the SoC bus, so that the QoS priorities corresponding to the access requests are considered in the process of issuing the access requests to the SoC bus, and accordingly access control services matched with the QoS IDs of the access requests are provided for each computing task from the source of the whole AI system.
In one possible implementation manner, the sub-scheduler is specifically configured to: respectively establishing task queues for the S processing nodes, wherein each task queue comprises a memory access request sent by the corresponding processing node; wherein the target processing node corresponds to a target task queue; when a target access request is currently inserted into the target task queue, respectively lifting the second QoS priorities corresponding to the QoS IDs carried in all the access requests in the target task queue to third QoS priorities, wherein the target access request is an access request with the second QoS priorities corresponding to the QoS IDs exceeding a preset priority; and according to the second QoS priority or the third QoS priority corresponding to the QoS IDs carried in the memory access requests in the task queues of the S processing nodes, the memory access requests in the task queues of the S processing nodes are sequentially sent to the SoC bus.
In the embodiment of the application, in the process of specifically scheduling access requests by a sub-scheduler of each subsystem, a task queue is created for a calculation task in each processing node (such as a Master), all access requests generated in each Master are placed in one task queue, and QoS priorities corresponding to QoS IDs carried by the access requests in each task queue are sequentially sent to an SoC bus; when a higher QoS priority access request currently appears in a task queue, in order to avoid the situation that all access requests in the task queue are forced to be blocked (e.g. queue head blocking) due to the lower QoS priority of the access request at the front end of the task queue, in the embodiment of the present application, the sub-scheduler in the Master promotes the QoS priority of all access requests in the task queue (i.e. from the second QoS priority to the third QoS priority), so that any access request (especially the access request with the higher QoS priority mentioned above) in the task queue will not cause the blocking of the access request of the whole task queue due to the lower QoS priority of the access request at the front end (or queue head) in the task queue, thereby optimizing the efficiency and effect of QoS access level control as a whole.
In one possible implementation, the SoC bus is configured to: receiving one or more access requests in the target task queue sent by the sub-scheduler, wherein the one or more access requests comprise the access request; and restoring the third QoS priority corresponding to the QoS ID carried in one or more access requests in the target task queue to the corresponding second QoS priority.
In the embodiment of the application, when the access requests in each task queue are subjected to QoS priority adjustment from the sub-schedulers of each processing node (such as a Master), and after the access requests are subjected to scheduling according to the adjusted QoS priorities, the access requests are scheduled to the SoC bus, and at the moment, the risk of blocking the access requests with low QoS priorities in each task queue is eliminated by performing QoS priority adjustment in the subsystem, so that after the access requests are scheduled to the SoC bus from each task queue, the access requests can be restored to the previous QoS priorities, namely, the third QoS priorities are restored to the corresponding second QoS priorities, thereby facilitating QoS control of access according to the QoS priorities corresponding to the QoS IDs initially allocated to each calculation task by the AI system.
In one possible implementation, the SoC bus is further configured to: and based on the second QoS priority after the recovery of the one or more memory requests in the target task queue, respectively scheduling the one or more memory requests in the target task queue to corresponding memory controllers in the N memory controllers.
In the embodiment of the application, after the SoC bus restores the QoS priority of the QoS ID of the access request scheduled from each subsystem to the initial second QoS priority, the scheduling of the access request can be performed according to the restored second QoS priority, that is, each access request is scheduled to the corresponding memory controller according to the restored second QoS priority, so that the memory controller performs subsequent access QoS control and memory access.
In one possible implementation, the AI SoC further includes an advanced memory access agent MATA: the SoC bus is specifically configured to: and sending one or more memory access requests in the target task queue to the MATA, and respectively scheduling the one or more memory access requests to corresponding memory controllers in the N memory controllers through the MATA. Optionally, in another possible implementation manner, the AI SoC further includes an advanced memory access agent MATA: the SoC bus is specifically configured to: and sending the memory access requests respectively sent by the S processing nodes to the MATA, and respectively scheduling the memory access requests respectively sent by the S processing nodes to corresponding memory controllers in the N memory controllers through the MATA, wherein the memory access requests respectively sent by the S processing nodes comprise the memory access requests.
In the embodiment of the present application, the AI SoC may further include a memory access agent MATA for performing memory access control, and in the process of scheduling the memory access request in each subsystem to the corresponding memory controller by the SoC bus, the memory access request may also be specifically scheduled to the corresponding memory controller by the MATA, that is, by the MATA, the memory controllers may be comprehensively controlled and managed, and the received memory access requests may also be further regulated and controlled, for example, the second QoS priority corresponding to the QoS ID in each memory access request may be further optimized.
In one possible implementation, the MATA is configured to: receiving the access request, and determining the second QoS priority corresponding to the QoS ID carried in the access request; determining the first QoS priority corresponding to the QoS ID based on the second QoS priority corresponding to the QoS ID, in combination with historical memory bandwidth statistics corresponding to the QoS ID, and a memory policy control parameter corresponding to the QoS ID, where the memory policy control parameter includes one or more of a highest bandwidth, a lowest bandwidth, and an access priority that allow an access request to pass.
In the embodiment of the present application, after the MATA receives each access request scheduled by the SoC bus, further optimization adjustment may be performed on an initial priority (i.e., a second QoS priority) carried in each access request, and a specific adjustment principle may include, before the access request is scheduled by the SoC bus to each memory controller, generating, by the MATA, a QoS priority (i.e., a first QoS priority) finally corresponding to the QoS ID according to the corresponding initial QoS priority (i.e., the second QoS priority) of the QoS ID carried in the access request and combining historical memory bandwidth statistics information corresponding to the QoS ID and access policy control parameters corresponding to the QoS ID currently recorded and stored by the MATA, so that the target memory controller may finally perform access control on the access request according to the final QoS priority. That is, when performing access control, the MATA considers not only the QoS priority initially configured for each QoS ID by the AI system, but also the historical bandwidth statistics information corresponding to each QoS ID (such as the memory bandwidth information that is currently obtained by a class of computing tasks carrying a same QoS ID) and the access policy control parameters (such as the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass through) that should be configured for the access request to obtain the access policy control parameters (such as the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass through) that are configured for the access policy control parameters, so as to comprehensively consider what access QoS control service is provided for the current access request, and finally obtain the QoS priority that matches with the access request, thereby performing more accurate access QoS control, and further optimizing and improving the computing performance of the AI system. For example, the memory access request corresponding to a certain QoS ID already occupies a large amount of memory bandwidth, and for balance consideration, the QoS priority of the QoS ID may be reduced to balance the memory access bandwidth occupation amount corresponding to each QoS ID; if the memory access request corresponding to a QoS ID currently occupies less memory bandwidth, the QoS priority corresponding to the QoS ID may be improved to make up for the memory access bandwidth occupation corresponding to the QoS ID.
In one possible implementation, the MATA is further configured to: presetting access policy control parameters corresponding to each QoS ID, and counting and recording historical memory bandwidths corresponding to each QoS ID; and updating and optimizing the access policy control parameters corresponding to each QoS ID according to the access performance real-time monitoring information of the AI system.
In the embodiment of the application, the MATA configures corresponding access policy control parameters for each QoS ID, and also counts and records the historical memory bandwidth corresponding to each QoS ID, so as to determine whether to improve the QoS priority or reduce the QoS priority on the basis of the initial priority corresponding to a certain QoS ID according to the two information. And finally determining the final QoS priority of a certain QoS ID corresponding to the access request, so that the memory controller can perform specific access QoS control according to the final QoS priority. For example, the MATA may set the highest bandwidth, lowest bandwidth, access priority, etc. that access requests carrying a certain QoS ID are allowed to pass. Furthermore, the MATA can update and optimize access policy control parameters corresponding to each QoS ID according to the access performance real-time monitoring information of the AI system, for example, the access policy control parameters are adjusted by an optimizing algorithm and an adaptive machine learning algorithm.
In one possible implementation, the MATA is further configured to: and carrying the first QoS priority in the access request, and scheduling the access request to the target memory controller based on the first QoS priority. Optionally, the MATA may send the QoS ID to the memory controller in addition to the first QoS priority carried in the memory request. Optionally, in another possible implementation manner, the AI SoC further includes a MATA; the MATA is used for: and carrying the determined first QoS priority in the access request, and scheduling the access request to the target memory controller based on the first QoS priority.
In the embodiment of the present application, after the MATA determines the final priority of the memory access request, the final priority (i.e., the first QoS priority) may be carried in the memory access request and sent to the corresponding memory controller, so that the corresponding memory controller may perform the memory access QoS control according to the first QoS priority. In addition, the MATA may schedule the memory request to the target memory controller based on the first QoS priority. Optionally, if the MATA carries the first QoS priority in the access request, and further carries the QoS ID in the access request and sends the access request to the memory controller, the memory controller may jointly make a decision for controlling the access QoS according to the first QoS priority and the QoS ID, for example, the memory controller may calculate, according to the QoS ID, a historical memory bandwidth occupied by the access request corresponding to the QoS ID on itself, and further optimize the access QoS control according to this.
In one possible implementation manner, the target memory controller is specifically configured to: and performing access QoS control on the access request based on the first QoS priority corresponding to the QoS ID and combining the access service condition of the target memory controller, wherein the access service condition comprises a memory access time sequence requirement or a memory bandwidth bus utilization rate.
In the embodiment of the application, after the memory access request is scheduled to each memory controller by the SoC bus through the MATA, the memory controller can perform the memory access QoS control on the memory access request according to the final QoS priority (i.e., the first QoS priority) carried in the memory access request and in combination with the current service condition of the memory controller. That is, when the memory controller performs access QoS control, not only the QoS priority finally generated by the MATA for each QoS ID is considered, but also the current service condition of each memory controller (for example, the access timing requirement on the memory controller, the memory bandwidth bus utilization rate, etc.) is further considered at the same time, so as to perform more accurate access QoS control, thereby further optimizing and improving the computing performance of the AI system. Optionally, when the access request received by the memory controller further carries a QoS ID, the memory controller may further calculate, according to the QoS ID, a historical memory bandwidth occupied by the access request corresponding to the QoS ID on its own, and further optimize the access QoS control according to the historical memory bandwidth.
In one possible implementation, the access service condition includes a memory access timing requirement, or a memory bandwidth bus utilization.
In the embodiment of the application, the memory controller finally needs to perform final memory access control on each memory request according to the current memory service condition, so that the memory QoS control on each memory request is more accurate and reasonable, the memory control is avoided from being performed only according to the QoS priority of the computing task, and the current actual conditions of each memory controller such as the memory access time sequence requirement and the memory bandwidth bus utilization rate can be further combined to comprehensively consider what memory QoS control service is provided for the current memory request.
In one possible implementation, the target memory controller is further configured to: and when the amount of the access requests received by the target memory controller is greater than a preset threshold value, broadcasting a back pressure indication to the M subsystems, wherein the back pressure indication is used for indicating one or more subsystems in the M subsystems to delay, reduce or stop sending the access requests.
In the embodiment of the present application, when the number of access requests received by a certain memory controller is excessive, the related subsystem may be instructed to reduce or delay or even stop the currently transmitted access requests, and after receiving the instruction, the related subsystem may adjust the transmission of the access requests according to its own situation, for example, suspend the transmission of the access requests to the SoC bus, or stop the transmission of the access requests to the SoC bus, etc.
In one possible implementation, the AI SoC further includes a host; the host is used for: receiving a task to be executed, and splitting the task to be executed into one or more computing tasks to be executed; identifying the service flow type of the one or more calculation tasks to be executed after splitting according to a preset service flow label table, wherein the preset service flow label table comprises a mapping relation between the service flow type of the predefined calculation task and QoS ID; and carrying corresponding QoS IDs for the one or more computing tasks to be executed respectively according to the identification result.
In the embodiment of the application, the AI system comprises a plurality of subsystems and a plurality of memory controllers which can execute the calculation tasks, and a Host for uniformly receiving various calculation tasks issued by users, wherein the Host can identify and mark the types of the service flows in the AI network model, namely, different access QoS labels, namely QoS IDs, of the service flows are given for the calculation tasks under different service flows, so that the following whole AI system can reasonably and matched access QoS control on the calculation tasks carrying the QoS IDs according to the QoS IDs, thereby finally realizing the load balance of the access of the whole AI system and improving the comprehensive execution performance and efficiency of the whole AI system.
In one possible implementation, the system further comprises a system scheduler; the host computer is further configured to: and sending one or more computing tasks carrying the corresponding QoS IDs to the system scheduler.
In the embodiment of the application, after the host in the AI system identifies the service flow and carries the QoS ID, the calculation tasks carrying the QoS ID can be sent to the system scheduler on the AI SoC for subsequent allocation. That is, after the host computer splits, identifies and labels the tasks to be executed, the processed computing tasks are issued to the system scheduler, so that the system scheduler can conveniently schedule and distribute the computing tasks which are already labeled (i.e. carry the matched QoS IDs).
In one possible implementation, the host or the target processing node is further configured to: and configuring a corresponding second QoS priority for the QoS ID in the computing task in advance, wherein the second QoS priority is an initial priority corresponding to the QoS ID.
In the embodiment of the present application, the host side or the inside of the target processing node also configures an initial QoS priority (i.e., a second QoS priority) for each computing task, that is, configures a matched QoS priority for each QoS ID. Therefore, the related modules in the subsequent AI SoCs can adjust the subsequent following QoS priority or the final QoS priority based on the initial QoS priority.
In one possible implementation, the host is further configured to: and updating and optimizing the second QoS priority corresponding to each QoS ID according to the access performance real-time monitoring information of the AI system.
In the embodiment of the application, the host side in the AI system can also update and optimize the initial QoS priority corresponding to each QoS ID in the system according to the real-time monitoring information of the access performance. QoS auto-optimization is performed adaptively, for example, by an optimization algorithm and an adaptive machine learning algorithm.
In one possible implementation, the system scheduler is configured to: receiving one or more computing tasks to be executed, which are sent by the host; each computing task to be executed also carries a task descriptor for describing the type of the computing task; selecting a matched subsystem from the M subsystems for each computing task to be executed according to the task descriptors carried in each computing task to be executed, and selecting a matched processing node from one or more processing nodes in the matched subsystem; and dispatching each computing task to be executed to the matched processing node in the matched subsystem.
In the embodiment of the application, after the Host in the AI system identifies the service flow and carries the QoS ID, the computing tasks carrying the QoS ID are sent to the system scheduler on the AI SoC, the system scheduler can reasonably distribute all the computing tasks sent by the Host, and the specific distribution principle can be to distribute according to the task descriptor carried in the computing tasks so as to distribute proper subsystems and processing nodes for each computing task according to the type of the task described by the task descriptor, thereby better completing the execution or acceleration of each computing task. For example, a certain AI matrix calculation task is assigned to the appropriate AI subsystem and to an idle processing node on that AI subsystem.
In one possible implementation, when the AI system is applied to a virtual scene, a plurality of virtual machines are included in the AI system, wherein each virtual machine in the plurality of virtual machines corresponds to one or more processes, and one process includes one or more computing tasks; the one or more processes run on one or more processing nodes of at least one of the M subsystems; the system scheduler is further configured to: assigning a VM ID to each virtual machine; and the VM ID of the corresponding virtual machine is shared in the page table of one or more processes corresponding to each virtual machine.
In the embodiment of the application, when the AI system is applied to a virtual scene, a VM ID is allocated to each virtual machine by taking the virtual machine as a unit, and all processes under the virtual machine are set to correspond to the same VM ID, so as to isolate different virtual machines, thereby ensuring the safety isolation and mutual influence between users corresponding to different virtual machines.
In one possible implementation, when the system is a virtual scene, the target subsystem further includes a system memory management unit SMMU; the target processing node is further configured to: sending the memory access request of the computing task to the SMMU, and updating the QoS ID carried in the memory access request of the computing task through the SMMU; the SMMU is configured to: receiving a memory access request of the computing task sent by the target processing node; determining a target process to which the computing task belongs according to a virtual address and a service set identifier SSID in the access request; and determining the VM ID of the target virtual machine corresponding to the target process according to the page table of the target process, and replacing the QoS ID carried in the access request of the computing task with the VM ID of the target virtual machine.
In the embodiment of the application, when the AI system is in a virtual scene, the original QoS ID allocation and circulation flow is replaced, and QoS IDs are uniformly replaced by the virtual machine to which the process belongs, namely, each processing node replaces the QoS ID carried in the received memory request through the SMMU in the processing node and uniformly replaces the VM ID of the virtual machine corresponding to the process to which the computing task corresponding to the memory request belongs, so that the purpose of taking bandwidth safety isolation as primary as possible in the scene is to meet the basic requirements that data isolation, computing resource isolation and mutual influence are not carried out among users of the virtual machine. Furthermore, the memory bandwidth isolation and bandwidth commitment problems among users of different virtual machines can be solved.
In one possible implementation, the AI SoC further includes an L2 Cache; the L2 cache is configured to: and receiving access requests of all the computing tasks, and accessing the corresponding storage areas in the L2 Cache according to QoS IDs carried in the access requests of all the computing tasks, wherein the access requests carrying different QoS IDs correspond to different storage areas in the L2 Cache.
In the embodiment of the application, the storage area of the cache which can be accessed by each access request can be controlled through the QoS ID carried in each access request, namely, the corresponding storage area in the cache is safely isolated through the QoS ID in the access request. Because the process under each virtual machine has the corresponding virtual machine ID, namely the VM ID, the VM ID to which the process belongs can be carried as the QoS ID in the corresponding access request, so that the Cache can be isolated based on the QoS ID to realize the safety isolation effect under the virtual machine scene.
In a second aspect, an embodiment of the present application provides a memory access control method, which is characterized in that the method is applied to an artificial intelligence AI system, where the AI system includes an AI system on a chip SoC, the AI SoC includes M subsystems and N memory controllers, and the M subsystems and the N memory controllers are interconnected through an SoC bus; the M subsystems comprise target subsystems, the target subsystem is any subsystem in the M subsystems, the target subsystem comprises S processing nodes, and M, N, S are integers which are larger than or equal to 1; the method comprises the following steps: receiving a calculation task to be executed through a target processing node in the S processing nodes, wherein the calculation task carries a quality of service identifier (QoS ID); the target processing node is any one of the S processing nodes; the QoS ID is used for indicating the category to which the computing task belongs; generating a memory access request of the computing task, wherein the memory access request carries the QoS ID in the computing task; the access request is sent to a target memory controller in the N memory controllers; receiving, by the target memory controller, the access request, and determining a first quality of service QoS priority corresponding to the QoS ID; and performing access QoS control on the access request based on the first QoS priority.
In one possible implementation, the computing task further carries a second QoS priority corresponding to the QoS ID, where the second QoS priority is an initial (basic) QoS priority corresponding to the QoS ID in the computing task.
In a possible implementation manner, the target subsystem further comprises a sub-scheduler; the sending, by the target processing node, the access request to the target memory controller of the N memory controllers includes: the access request is sent to the sub-scheduler through a target processing node, and is scheduled to a target memory controller in the N memory controllers through the sub-scheduler; the method further comprises the steps of: receiving access requests sent by the S processing nodes in the target subsystem respectively through the sub-scheduler; scheduling the access requests respectively sent by the S processing nodes to the SoC bus according to second QoS priorities corresponding to QoS IDs carried in the access requests respectively sent by the S processing nodes, wherein the second QoS priorities are initial QoS priorities corresponding to the QoS IDs; the second QoS priority is used to indicate a priority of the corresponding access request being scheduled to the SoC bus.
In a possible implementation manner, the scheduling, by the sub-scheduler, the access requests sent by the S processing nodes to the SoC bus according to the second QoS priorities corresponding to the QoS IDs carried in the access requests sent by the S processing nodes respectively includes: respectively establishing task queues for the S processing nodes through the sub-schedulers, wherein each task queue comprises access requests sent by the corresponding processing node; wherein the target processing node corresponds to a target task queue; when a target access request is currently inserted into the target task queue, respectively lifting the second QoS priorities corresponding to the QoS IDs carried in all the access requests in the target task queue to third QoS priorities, wherein the target access request is an access request with the second QoS priorities corresponding to the QoS IDs exceeding a preset priority; and according to the second QoS priority or the third QoS priority corresponding to the QoS IDs carried in the memory access requests in the task queues of the S processing nodes, the memory access requests in the task queues of the S processing nodes are sequentially sent to the SoC bus.
In one possible implementation, the method further includes: receiving one or more access requests in the target task queue sent by the sub-scheduler through the SoC bus, wherein the one or more access requests comprise the access request; and restoring the third QoS priority corresponding to the QoS ID carried in one or more access requests in the target task queue to the corresponding second QoS priority.
In one possible implementation, the method further includes: and scheduling one or more memory requests in the target task queue to corresponding memory controllers in the N memory controllers respectively based on the second QoS priority after the recovery of the one or more memory requests in the target task queue through the SoC bus.
In one possible implementation, the AI SoC further includes an advanced memory access agent MATA: the scheduling, by the SoC bus, the one or more memory access requests in the target task queue to corresponding memory controllers in the N memory controllers includes: and sending one or more memory access requests in the target task queue to the MATA through the SoC bus, and respectively scheduling the one or more memory access requests to corresponding memory controllers in the N memory controllers through the MATA. Optionally, in another possible implementation manner, the AI SoC further includes an advanced memory access agent MATA: the SoC bus is specifically configured to: and sending the memory access requests respectively sent by the S processing nodes to the MATA, and respectively scheduling the memory access requests respectively sent by the S processing nodes to corresponding memory controllers in the N memory controllers through the MATA, wherein the memory access requests respectively sent by the S processing nodes comprise the memory access requests.
In one possible implementation, the method further includes: receiving the access request through the MATA, and determining the second QoS priority corresponding to the QoS ID carried in the access request; determining the first QoS priority corresponding to the QoS ID based on the second QoS priority corresponding to the QoS ID, in combination with historical memory bandwidth statistics corresponding to the QoS ID, and a memory policy control parameter corresponding to the QoS ID, where the memory policy control parameter includes one or more of a highest bandwidth, a lowest bandwidth, and an access priority that allow an access request to pass.
In one possible implementation, the method further includes: presetting access policy control parameters corresponding to each QoS ID through the MATA, and counting and recording historical memory bandwidths corresponding to each QoS ID; and updating and optimizing the access policy control parameters corresponding to each QoS ID according to the access performance real-time monitoring information of the AI system. For example, the access policy control parameters corresponding to the QoS IDs are updated and optimized through an optimizing algorithm and an adaptive machine learning algorithm.
In one possible implementation, the method further includes: and carrying the first QoS priority in the access request through the MATA, and scheduling the access request to the target memory controller based on the first QoS priority. Optionally, in another possible implementation manner, the AI SoC further includes a MATA; the method further comprises the steps of: and carrying the determined first QoS priority in the access request through the MATA, and scheduling the access request to the target memory controller based on the first QoS priority.
In one possible implementation manner, the performing, by the target memory controller, access QoS control on the access request based on the first QoS priority includes: and performing access QoS control on the access request by the target memory controller based on the first QoS priority corresponding to the QoS ID and combining the access service condition of the target memory controller, wherein the access service condition comprises a memory access time sequence requirement or a memory bandwidth bus utilization rate.
In one possible implementation, the method further includes: and when the amount of the access requests received by the target memory controller is greater than a preset threshold value, broadcasting back pressure indication to the M subsystems through the target memory controller, wherein the back pressure indication is used for indicating one or more subsystems in the M subsystems to delay, reduce or stop sending the access requests.
In one possible implementation, the AI system further includes a host; the method further comprises the steps of: receiving a task to be executed through the host, and splitting the task to be executed into one or more computing tasks to be executed; identifying the service flow type of the one or more calculation tasks to be executed after splitting according to a preset service flow label table, wherein the preset service flow label table comprises a mapping relation between the service flow type of the predefined calculation task and QoS ID; and carrying corresponding QoS IDs for the one or more computing tasks to be executed respectively according to the identification result.
In one possible implementation, the AI SoC further includes a system scheduler; the method further comprises the steps of: and sending, by the host, one or more computing tasks carrying the corresponding QoS IDs to the system scheduler.
In one possible implementation, the method further includes: and configuring a corresponding second QoS priority for the QoS ID in the computing task in advance by the host or by the target Maser, where the second QoS priority is an initial priority corresponding to the QoS ID.
In one possible implementation, the method further includes: and updating and optimizing the second QoS priority corresponding to each QoS ID through the host according to the access performance real-time monitoring information of the AI system. Updating and optimizing the second QoS priorities corresponding to the respective QoS IDs, for example, by an optimizing algorithm and an adaptive machine learning algorithm.
In one possible implementation, the method further includes: receiving, by the system scheduler, the one or more computing tasks to be executed sent by the host; each computing task to be executed also carries a task descriptor for describing the type of the computing task; selecting a matched subsystem from the M subsystems for each computing task to be executed according to the task descriptors carried in each computing task to be executed, and selecting a matched processing node from one or more processing nodes in the matched subsystem; and dispatching each computing task to be executed to the matched processing node in the matched subsystem.
In one possible implementation, when the AI system is applied to a virtual scene, a plurality of virtual machines are included in the AI system, wherein each virtual machine in the plurality of virtual machines corresponds to one or more processes, and one process includes one or more computing tasks; the one or more processes run on one or more processing nodes of at least one of the M subsystems; the method further comprises the steps of: allocating a VM ID for each virtual machine through the system scheduler; and the VM ID of the corresponding virtual machine is shared in the page table of one or more processes corresponding to each virtual machine.
In one possible implementation, when the system is a virtual scene, the target subsystem further includes a system memory management unit SMMU; the method further comprises the steps of: sending a memory access request of the computing task to the SMMU through the target processing node, and updating the QoS ID carried in the memory access request of the computing task through the SMMU; receiving a memory access request of the computing task sent by the target processing node through the SMMU; determining a target process to which the computing task belongs according to a virtual address and a service set identifier SSID in the access request; and determining the VM ID of the target virtual machine corresponding to the target process according to the page table of the target process, and replacing the QoS ID carried in the access request of the computing task with the VM ID of the target virtual machine.
In one possible implementation, the AI SoC further includes an L2 Cache; the method further comprises the steps of: and receiving access requests of all computing tasks through the L2 Cache, and accessing corresponding storage areas in the L2 Cache according to QoS IDs carried in the access requests of all computing tasks, wherein the access requests carrying different QoS IDs correspond to different storage areas in the L2 Cache.
In a third aspect, the present application provides a semiconductor chip, which may include the AI system provided by any one of the implementations of the first aspect.
In a fourth aspect, the present application provides a semiconductor chip, which may include: an AI system, an internal memory coupled to the AI system, and an external memory provided by any implementation of the first aspect.
In a fifth aspect, the present application provides a semiconductor chip, which may include: a Host provided by any implementation manner of the first aspect.
In a sixth aspect, the present application provides a semiconductor chip, which may include: at least one AI SoC provided by any implementation manner of the first aspect.
In a seventh aspect, the present application provides a system-on-chip SoC chip including the AI system provided by any one of the implementations of the first aspect, an internal memory coupled to the bus system, and an external memory. The SoC chip may be formed by a chip, or may include a chip and other discrete devices.
In an eighth aspect, the present application provides a chip system, which includes the AI system provided by any implementation manner of the first aspect. In one possible design, the AI system further includes a memory for storing program instructions and data necessary or relevant to the chip system during operation. The chip system can be composed of chips, and can also comprise chips and other discrete devices.
In a ninth aspect, the present application provides an electronic device, which may include the AI system provided by any implementation manner of the first aspect.
In a tenth aspect, the present application provides an electronic device having a function of implementing any one of the bus communication methods of the first aspect described above. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.
In an eleventh aspect, the present application provides an AI device having a function of implementing any one of the AI computation methods of the first aspect described above. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.
In a twelfth aspect, the present application provides a computer-readable storage medium storing a computer program that, when executed by a bus system, implements the AI computation method flow of any of the above second aspects.
In a thirteenth aspect, an embodiment of the present application provides a computer program including instructions that, when executed by a bus system, enable the bus system to perform the AI computation method flow as set forth in any one of the second aspects above.
DrawingsFig. 1A is a schematic hardware structure of an AI system according to an embodiment of the present application.
Fig. 1B is a schematic hardware structure of another AI system according to an embodiment of the present application.
Fig. 1C is a schematic hardware structure of yet another AI system according to an embodiment of the application.
Fig. 2A is a schematic diagram of a relationship between a service flow and a graph node and a computing task according to an embodiment of the present application.
Fig. 2B is a schematic diagram of a traffic flow direction according to an embodiment of the present application.
Fig. 2C is a schematic diagram of a relationship between a traffic flow type and a memory bandwidth involved in an operation process of a resnet network according to an embodiment of the present application.
Fig. 3A is a schematic diagram of a framework of a Davinci software stack according to an embodiment of the present application.
Fig. 3B is a schematic diagram of an interaction flow between each software module in a Davinci software stack according to an embodiment of the present application.
Fig. 4A is a schematic diagram of a graph compiling stage and a graph running stage according to an embodiment of the present application.
Fig. 4B is a script diagram of an AI model resnet provided by an embodiment of the present application.
FIG. 4C is a schematic diagram of an executable computing task after graph compilation and optimization according to an embodiment of the present application.
Fig. 5A is a schematic diagram of a software architecture for QoS automatic optimization according to an embodiment of the present application.
Fig. 5B is a flowchart of a method for QoS automatic optimization according to an embodiment of the present application.
Fig. 6A is a software architecture diagram of an AI system in a virtual scenario according to an embodiment of the present application.
Fig. 6B is a schematic diagram of an interaction flow between each software module in an AI system under a virtual application scenario according to an embodiment of the present application.
Fig. 7 is a flowchart of a memory access control method according to an embodiment of the present application.
Detailed DescriptionEmbodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. The terms "first," "second," "third," and "fourth" and the like in the description and in the claims and drawings are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between 2 or more computers. Furthermore, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with one another in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).
First, some terms in the present application will be explained in order to be understood by those skilled in the art.
(1) Quality of service (Quality of Service, qoS), in the present application, access (i.e., memory access) QoS refers to performing corresponding memory access service control for various types of computing tasks (such as various traffic flows) under limited memory bandwidth resources, for example, controlling the highest bandwidth, the lowest bandwidth, or the access priority of various types of access memory requests, and providing service quality guarantee of memory access for corresponding computing tasks (such as computing tasks under certain traffic flows).
(2) The service set identifier (SERVICE SET IDENTIFIER, SSID) can divide a wireless local area network into a plurality of sub-networks requiring different identity verification, each sub-network requires independent identity verification, and only users passing the identity verification can enter the corresponding sub-network to prevent unauthorized users from entering the network.
(3) Trillion operations per second (Tera Operations Per Second, TOPS), representing one trillion operations per second (10≡12), may be used to characterize the quality of service of a processor.
(4) Virtual Machines (VMs) refer to complete computer systems that run in a completely isolated environment with complete hardware system functionality through software emulation. Work that can be done in a physical computer can be done in a virtual machine. When creating a virtual machine in a computer, a part of hard disk and memory capacity of the physical machine are required to be used as the hard disk and memory capacity of the virtual machine.
(5) A visual preprocessor (Davinci Vision Pre-Processor, DVPP) mainly implements Video Decoding (VDEC), video Encoding (VENC), JPEG codec (JPEGD/E), PNG decoding (PNGD), visual preprocessing unit (VPC), and the like.
(6) A matrix computation core (AIC) for implementing matrix computation in AI operations.
(7) A scalar computation core (AIV) for implementing scalar computation in AI operations.
(8) A graph generation engine (GRAPH ENGINE, GE) in the AI platform is responsible for converting intermediate expression (IR) compilations of AI models generated by various mainstream AI computing frameworks into computational subgraphs that the AI platform (e.g., davinc platform) can understand and execute. The main functions of the GE include graph preparation, graph splitting, graph optimization, graph compilation, graph loading, graph execution, and graph management, etc. (herein the graph refers to a network model topology graph).
(9) The diagram Fusion engine (Fusion ENGINE DAVINC, FE) in the AI platform is responsible for interfacing GE and tensor acceleration engine (Tensor Boost Engine, TBE) operators, and has the capabilities of loading and managing an operator information base, managing Fusion rules, fusing original diagrams and optimizing subgraphs. The GE transmits the sub-graph to the FE in the sub-graph optimizing stage, the FE precompiles according to the operator information base and the FE fusion optimization, such as modifying data types, inserting conversion operators and the like, and the sub-graph is transmitted to the GE again for sub-graph merging and sub-graph optimization.
(10) Remote direct data access controller (Remote Direct Memory Access, RDMA), RDMA transfers data directly into the memory area of a computer over a network, and moves data from a system to remote system memory quickly without any impact on the operating system, thus eliminating the need for how many processing functions of the computer are required. It eliminates the overhead of external memory copying and context switching, thus freeing up memory bandwidth and CPU cycles for improved application system performance.
(11) A hardware implementation (RDMA over Converged ETHERNET ENGINE, ROCE) of an ethernet-based RDMA protocol, based on which large amounts of data can be transferred at high speed between different machines over a standard ethernet network.
(12) A SoC-level direct memory Access controller (SYSTEM DIRECT memory Access, SDMA) may be used for efficient multi-channel data transfer with data DMA Access and movement in each subsystem inside the SoC.
(13) The PCI expansion bus direct memory access controller (PERIPHERAL COMPONENT CNTERCONNECT EXPRESS DIRECT MEMORY ACCESS, PCIE DMA), wherein PCIE is a high-speed serial computer expansion bus standard, and PCIE DMA is a DMA controller which is realized by following the PCIE standard specification and can be used for high-efficiency data transmission between HOST and Device.
(14) (Huawei Cache-Coherent SYSTEM HCCS) is a self-defined protocol standard for maintaining data consistency among multiple SoCket, and is a physical link of multi-path CPU interconnection of an ARM architecture server for cross-chip communication.
(15) Runtime (run time) refers to the state in which a program is running (cc or being executed). In some programming languages, certain reusable programs or instances are packaged or rebuilt into "libraries". These instances may be linked or invoked by any program as they run.
(16) The neural network computing architecture (Compute Architecture for Neural Networks, CANN) is a heterogeneous computing architecture proposed for an AI scene, and can be used for supporting a user to quickly construct applications and services based on an AI platform by providing a multi-level programming interface. So as to improve the development efficiency of users and the calculation power of the AI processor.
(17) The method is characterized in that a collective communication library (Huawei Collective Communication Library, HCCL) is provided for the outside, a collective communication operator is provided for supporting the RoCE transmission function between the network card and different nodes of the cluster, and efficient data transmission capability is provided for different NPUs in distributed training. Operator library: the system mainly provides communication functions of a single machine multi-card and Broadcast, allreduce, reducescatter, allgather among the multiple machines multi-card, and the like, and provides high-efficiency data transmission capability in distributed training.
(18) The high bandwidth memory (High Bandwidth Memory, HBM) is a high performance DRAM based on a 3D stack process, i.e., a memory chip (i.e., "RAM"), has the characteristics of high speed and high bandwidth, and is suitable for applications requiring high memory bandwidth, such as graphics processors, network switching and forwarding devices (e.g., routers, switches), etc.
(19) The DRV file is a file in the driver package that can be opened with a notepad or tablet. The DRV file is a driver file created by the connection and communication hardware devices (including external and internal) used in the Windows operating system. The commands and parameters including how to set the operating system devices and communicate together can also be used to install device drivers on a computer.
(20) Input/Output Control (IOCTL), in a computer, is a system call dedicated to the Input/Output operations of the device, the call being passed into a request code associated with the device, the function of the system call being entirely dependent on the request code.
Referring to fig. 1A, fig. 1A is a schematic hardware structure of an AI system provided by an embodiment of the present application, where the AI system 01 may be located in any electronic device, such as a computer, a mobile phone, a tablet, a server, and other devices. The hardware structure of the AI system 01 may specifically be a chip or a chipset or a circuit board on which the chip or the chipset is mounted. The chip or chip set or the circuit board on which the chip or chip set is mounted may be operated under the necessary software drive. In the AI system 01 shown in fig. 1A, a host 10 and an AI SoC 20 may be included. Wherein the AI SOC 20 may include M subsystems (as shown in fig. 1A, which may include subsystems 201-1, … …, subsystem 201-M) and N memory controllers 203 (as shown in fig. 1A, which may include memory controller 203-1, memory controller 203-2, … …, memory controller 203-N) connected by a System On Chip (SOC) bus 202, and each memory controller is configured to control at least one memory; the M subsystems include a target subsystem, where the target subsystem is any one of the M subsystems (for convenience of description, the target subsystem 201-1 will be taken as an example in the subsequent embodiment, it is to be understood that this example does not constitute any limitation on the target subsystem itself), and the target subsystem includes S processing nodes (as shown in fig. 1A, may include a Master1, a Master 2, … …, and a Master S), and it should be noted that, for convenience of description in the subsequent related embodiments of the present application, the processing nodes may be named or translated into a Master, or the processing nodes may be understood as other types of nodes including a Master, and it is to be understood that description that the processing nodes are equivalent to a Master or take a Master as an example does not constitute any limitation on the processing nodes themselves. Wherein the S masters include a target Master, which is any subsystem of the M masters (for convenience of description, the following embodiment will take the target Master as a Master1 in the target subsystem 201-1 as an example, and it is also understood that this example does not constitute any limitation on the target Master itself). It is further understood that the number of masters included in different subsystems may be equal or unequal, and embodiments of the present application are not specifically limited in this regard. Wherein M, N, S are integers greater than or equal to 1. Optionally, referring to fig. 1B, fig. 1B is a schematic hardware structure diagram of another AI system according to an embodiment of the present application, and compared to the AI system in fig. 1A, the AI system in fig. 1B may further include an advanced memory access agent (MATA) 205 for overall management of the N memory controllers ((203-1 to 203-N)).
The various components and their functions involved in the AI system 01 or the AI system 02 in the embodiment of the present application are exemplarily described below from top to bottom in conjunction with the above-described hardware structure of the AI system 01 shown in fig. 1A:
The host 10 may include a host CPU, an internal memory, and optionally a host controller, other input/output controllers, interfaces, and other physical devices not shown in fig. 1A. The Host CPU may have a Host System (Host System) running thereon, such as X86, ARM, and the like. In the embodiment of the present application, the host 10 may serve as a center for traffic deployment and task management of the AI system 01 or the AI system 02, and is used for managing a plurality of hardware accelerators (devices), such as various socs, and it is understood that at least the AI SoC 20 described in the present application is included. Specifically, the functions of the host 10 include managing tasks, communicating instructions, or providing specific services to the respective socs, etc.; further, the method can be used for identifying the type of the service flow issued by the user (including segmentation and identification of an AI computing framework and a model, etc.), allocating a proper QoS ID to a corresponding computing task in the service flow, namely marking each computing task with a proper QoS label, for example, allocating a proper computing task with a carried QoS ID to each hardware accelerator (Device) based on the processing capability characteristics of each hardware accelerator (Device). Alternatively, the host 10 may also carry, for each computing task, an initial QoS priority corresponding to the QoS ID carried by the host, i.e., a second QoS priority.
The AI SoC 20, which is an artificial intelligence system-on-chip, as shown in fig. 1A or 1B, the AI SoC 20 may specifically include a system scheduler 200, a plurality of subsystems (such as subsystems 201-1, … …, subsystem 201-M), a system-on-chip bus 202, a plurality of memory controllers (such as memory controllers 203-1, … …, memory controller 203-N) and further may further include a plurality of memories (memory 204-1, memory 204-2, … …, memory 204-N), that is, each memory controller is configured to control at least one memory, where any one memory may be a high bandwidth memory (High Bandwidth Memory, HBM) or a Double Data Rate (DDR) and the like. Wherein,
After the calculation task carrying the QoS ID is issued from the host 10 to the AI SoC 20, the system scheduler 200 may first go through the system scheduler 200 in the AI SoC 20, and the system scheduler 200 may schedule each calculation task to each subsystem suitable for executing the calculation task according to the type of the calculation task (e.g., according to the task descriptor carried in the calculation task), and further schedule each calculation task to an appropriate Master.
Subsystems (201-1-201-M) each may be an integrated circuit with dedicated functions or an accelerator for accelerating a function. For example, the subsystem may be an artificial intelligence CORE (AI CORE), a visual preprocessor (DVPP), an Image Signal Processor (ISP), a sound signal processor (ASP), a DMA controller at SOC system level (SDMA), a remote direct data access controller (RDMA), a peripheral interconnect expansion bus direct memory access controller (PCIE DMA), an encryption and decryption engine, or a general purpose CPU, etc. As shown in fig. 1A or 1B. Within each subsystem, the masters may also be interconnected by a bus (Connect bus) 211 within the subsystem, and a sub-scheduler 212 may also be further included within each subsystem. It should be noted that, not all of the M subsystems are AI subsystems, and may include a system or other subsystems not used for AI computation but used in cooperation with the AI subsystem; accordingly, not all the computing tasks are AI computing tasks, and there may be some tasks that cooperate to complete AI computing, or some general computing tasks. Wherein,
Processing nodes (Master 1-Master S) can be understood as a requester capable of initiating a memory access request, a source of the memory access request or a data requester in the application, and one or more processing nodes (such as Master) in each subsystem can represent multiple cores of the subsystem. When the computing or communication task needs to access the memory system, each Master sends the QoS ID carried by the task together with the information such as the data and address of the memory to be accessed to the system-on-chip bus 202 through the sub-scheduler 212. For example, when the subsystem is a general-purpose CPU, then multiple masters within the subsystem may represent general-purpose CPU cores; when the subsystem is a GPU, then multiple masters within the subsystem may represent GPU cores; when the subsystem is an NPU, then multiple masters within the subsystem may represent NPU cores, etc.
The sub-scheduler 212 may be configured to schedule access requests generated by all masters in the subsystem to which the sub-scheduler belongs during execution of the computing task. For example, a queue of corresponding memory access requests is established for each Master, and the memory access requests generated in each Master are stored according to the sequence, and the memory access requests in each queue are scheduled according to the sequence in the queue and the QoS priority corresponding to the QoS ID carried in each memory access request in the queue, and are scheduled to the system-on-chip bus (SoC Connection BUS) 202 in the AI SoC 20.
The system-on-chip bus 202 (SoC Connection BUS) can realize the connection between each subsystem (201-1-201-M) and each memory controller (203-1-203-N) in the AI SoC 20, and realize the data communication between each subsystem (201-1-201-M) and each memory controller (203-1-203-N) in a bus manner. That is, the memory access requests (abbreviated as memory access requests) of the respective subsystems (201-1-201-M) may be issued to the corresponding memory access controllers after the arbitration, address resolution and routing of the system-on-chip bus 202. It will be appreciated that the specifications of the system-on-chip bus may also define relationships among various modules, such as drive, timing, policy, etc. during initialization, arbitration, request transmission, response, transmission, reception, etc.
Advanced memory access agent (Memory Advanced Technology Agent, MATA) may be configured to overall manage the N memory controllers (203-1 to 203-N), and may configure corresponding access policy control parameters (such as one or more of a highest bandwidth, a lowest bandwidth and an access priority that allow the access request to pass) for each QoS ID, and generate an optimized QoS priority for each access request, that is, optimize and regulate a second QoS priority (such as an initial priority, a default priority, or a default priority corresponding to the QoS ID) corresponding to the QoS ID carried in each access request, to generate a corresponding first QoS priority (possibly by lifting or possibly by lowering based on the original QoS priority). For example, the final QoS priority of the access request is calculated in combination with the corresponding historical bandwidth statistics of the QoS IDs and in combination with the access policy corresponding to the QoS ID. In addition, the MATA may also count and record the historical memory bandwidth occupied by each QoS ID on each memory controller. In the embodiment of the application, the QoS ID carried in a certain computing task can be unchanged, but the QoS priority corresponding to the QoS ID can be changed, and the access policy control parameter corresponding to a certain QoS ID can be changed. Alternatively, the MATA may be disposed outside N memory controllers, or may be disposed on a certain memory controller, that is, the MATA and each memory controller may be independent physical entities, or the MATA may be integrated on one or more memory controllers, which is not limited by the physical relationship between the MATA and the memory controller in the embodiment of the present application.
And the memory controllers (203-1-203-N) are used for controlling the memory and are responsible for data exchange between the memory and each subsystem. The memory controller determines which memory to send the data to according to the address in the access request sent by each subsystem, or determines which memory to read the data from and return the data to the corresponding subsystem. In addition, the memory controller may also perform virtual address to physical address mapping, memory access rights control, cache support, and the like. Specifically, after receiving a memory request scheduled by a MATA, the target memory access controller analyzes a read/write address in the memory request and a QoS priority carried in the memory request (i.e., a first QoS priority), and finally performs memory QoS control on the memory request according to the QoS priority and a service condition of the memory controller (such as a memory access timing requirement, or a memory bandwidth bus utilization rate, etc.), for example, whether to immediately allow access, temporarily store the command, wait for a next round of schedule, arbitrate access, etc., if the memory request is scheduled this time, the memory request will be sent to a specific memory access unit to perform a read/write operation, and the memory controller updates bandwidth statistics corresponding to the QoS ID. It should be noted that, the final QoS priority calculated by the MATA may be just one of the factors according to which the memory controller performs the memory QoS, for example, may also be according to timing factors, starvation mechanism, and so on. It can be appreciated that in the AI system of the embodiment of the present application, the QoS priority may be determined in some cases, but may be determined according to the actual memory situation in some cases, so it is not necessarily that the higher the QoS priority, the better the quality of service of the access finally obtained.
The memories (204-1-204-N), which may also be referred to as internal memories, are typically power-down volatile memories, and upon power failure may lose the contents stored thereon, may also be referred to as memories (memories) or main memories. The internal memory in the present application includes a readable and writable running memory, which is used to temporarily store the operation data in each subsystem (201-1 to 201-M) and to interact with the external memory of the AI SoC20, and can be used as a storage medium for temporary data of the operating system or other running programs. For example, a processing program, operating program or operating system running on Master1 in subsystem 201-1 for performing computing tasks calls the data to be operated from internal memory 204-2 to Master1 for operation, and when operation is completed, master1 then transmits the result. Alternatively, the internal memory may comprise one or more of Dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), synchronous Dynamic Random Access Memory (SDRAM), and the like. The DRAM further includes a double-rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM) abbreviated as DDR, a second generation double-rate synchronous dynamic random access memory (DDR 2), a third generation double-rate synchronous dynamic random access memory (DDR 3), a fourth generation low power consumption double data rate synchronous dynamic random access memory (Low Power Double Data Rate, lpddr 4), a fifth generation low power consumption double data rate synchronous dynamic random access memory (Low Power Double Data Rate, lpddr 5), and the like.
It is to be understood that the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the AI system 01 or the AI system 02. In other embodiments of the application, the AI system 01 or AI system 02 can include more or fewer components than shown, or can combine certain components, or can split certain components, or can have a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Referring to fig. 1C, fig. 1C is a schematic hardware structure of another AI system according to an embodiment of the present application, where the AI system 03 can be located in any electronic device, such as a computer, a mobile phone, a tablet, and other devices. The hardware structure of the AI system 03 may be a chip or a chipset or a circuit board on which the chip or the chipset is mounted. The chip or chip set or the circuit board on which the chip or chip set is mounted may be operated under the necessary software drive. From the viewpoint of the implemented functions, the AI system 03 in fig. 1C is different from the AI system 01 in fig. 1A or the AI system 02 in fig. 1B in that the hardware structure of the AI system 03 may be used to support implementation of bandwidth isolation between virtual machine tenants, which is a virtualized scenario, including user data isolation, computing power resource isolation, etc. between different virtual machine users, so as to meet the requirement that services between different users do not affect each other; from a hardware architecture point of view, the main difference between the AI system 03 in fig. 1C and the AI system 01 in fig. 1A or the AI system 02 in fig. 1B is that the AI system 03 further includes a system memory management unit SMMU 210 and a level two cache (L2 cache) 206 in addition to the components shown in the AI system 01 or the AI system 02 in the AI SoC subsystem in the AI system 03; optionally, the subsystem of the AI SoC in the AI system 03 may further include an advanced memory access agent MATA 205.
In the following, in conjunction with the above-described hardware structure of the AI system 02 shown in fig. 1C, various components and functions thereof involved in the AI system 03 in the embodiment of the present application are exemplarily described from top to bottom, and it will be understood that only newly added components in fig. 1C are described below, and for components common to those in fig. 1A or 1B in fig. 1C, reference may be made to related descriptions in fig. 1A or 1B, which are not repeated herein. Wherein,
A system memory management unit (System Memory Management Unit, SMMU) 210 may be located within each subsystem and between each Master and the connection bus 211. The SMMU 210 may perform rights management: if the address spaces among the programs are different, the program is used for controlling the authority of different programs; address mapping may be performed: converting the virtual address and the physical address; physical memory management: if the physical memory resource of the system is managed, operation interfaces such as application, release and the like of the physical memory are provided for the user program. Further, the SMMU may also be used to perform isolation in a virtual scenario, such as address isolation between different processes, isolation of a physical address space, isolation of a memory bandwidth, and the like. Because in the embodiment of the application, each virtual machine corresponds to a unique VM ID, in a virtualized scenario, the address conversion of the system memory management unit includes converting the QoS ID, specifically, searching a page table corresponding to the current process through the SMMU, and then obtaining a virtual machine identifier (VM ID) in the page table. That is, the page tables in all the different processes in the same VM are eventually the same QoS ID. In the embodiment of the application, the QoS IDs in all different processes in the same VM are converted into the same QoS ID, and the VM IDs of different VMs are different, so that the QoS IDs corresponding to the processes under the different VMs are different finally. It should be noted that, when the virtual machine is created, the VM ID in the page table is allocated by the system, and only the SMMU will query the virtual machine.
It should be noted that, in the embodiment of the present application, the System Memory Management Unit (SMMU) is a functional unit inside each subsystem (201-1 to 201-M) and is used to "translate" program addresses into physical addresses; the memory controllers (203-1-203-N) may be external devices to the subsystems (201-1-201-M) and may be responsible for mapping physical addresses to specific memory locations.
L2 Cache (L2 Cache) 206, which is the memory that allows for high speed data exchange, exchanges data with each subsystem (201-1-201-M) prior to each memory (204-1-204-N) in the present application, thus making it faster. It will be appreciated that if the Cache is internal to each subsystem (201-1-201-M), the Cache may be referred to generally as an L1 Cache, while the L2 Cache as shown in FIG. 1C is external to each subsystem, i.e., an external Cache, and between each subsystem and each memory (204-1-204-N). In the embodiment of the application, the L2 Cache may be applied in a virtual scenario, when each virtual machine is online, a corresponding VM ID may be configured for each virtual machine through a related management unit, and a corresponding storage area (such as an address range and a size of a storage space) may be configured for each virtual machine in the L2 Cache, that is, a memory access request corresponding to each virtual machine is only allowed to access a storage area corresponding to the configured virtual machine, but not access to a storage area corresponding to other virtual machines, so that the AI system may implement secure isolation of the Cache in the virtual scenario.
It is to be understood that the structure illustrated in the embodiment of the present application does not constitute a specific limitation of the AI system 03. In other embodiments of the application, the AI system 03 can include more or fewer components than shown, or can combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
It should be noted that the AI system architectures shown in fig. 1A, 1B and 1C are only exemplary implementations of the embodiments of the present application, and the AI system architectures in the embodiments of the present application include, but are not limited to, the above architectures.
Based on the hardware architecture of the AI system in fig. 1A, 1B, or 1C, in the embodiment of the present application, the specific functions implemented by the AI system 01, 02, or 03 may include the following:
The target Master (in the example of Master 1) of the S processing nodes masters included in the target subsystem (in the example of subsystem 201-1) in the AI SoC 20 is configured to: receiving a computing task to be executed, wherein the computing task carries a quality of service identifier QoS ID; the target Master is any one Master in the S Master; the QoS ID is used for indicating the category to which the computing task belongs; generating a memory access request of the computing task according to the memory address and data which are required to be accessed by the computing task, wherein the memory access request carries the QoS ID; sending the memory access request of the computing task to a target memory controller in the N memory controllers; the target memory controller (for example, memory controller 203-1) is configured to: receiving the access request, and determining a first quality of service (QoS) priority corresponding to the QoS ID; and performing access QoS control on the access request based on the first QoS priority.
Specifically, when computing tasks are allocated to a Master in each subsystem in an AI SoC, qoS IDs of service quality identifiers are carried in corresponding computing tasks, that is, qoS IDs carried in the computing tasks are used to represent classes to which the computing tasks belong, and access QoS priorities corresponding to the computing tasks can be finally determined according to the classes to which the computing tasks belong, which is based on the fact that in the AI computing field, the requirements of computing tasks of different classes (such as computing tasks under different service flows) on access quality are different, and access competition exists between computing tasks of certain classes, and no access competition exists between computing tasks of certain classes, therefore, the access requirements of computing tasks of different classes can be better met by setting QoS priorities matched with the access requests of the computing tasks according to the classes to which the access requests belong (it can be understood that computing tasks of different classes can correspond to different QoS IDs, but different QoS IDs can correspond to the same QoS priorities and different QoS priorities). Further, in the process that each Master executes the received calculation task, it generates the access request of each calculation task according to the memory address and data to be accessed by each calculation task, and continuously carries the QoS ID carried in the calculation task in the access request, that is, the QoS ID is transferred to the corresponding access request along with the calculation task flow, so that when the subsequent memory controller receives the access request, the memory controller can perform memory access control of the access request with corresponding priority according to the QoS ID carried by the subsequent memory controller, for example, the higher the QoS priority corresponding to the QoS ID, the memory controller can provide better memory access service quality for the access request carrying the QoS ID, that is, different access QoS control is performed for the calculation task with different access priority requirements, thereby avoiding serious system performance degradation caused by random preemption of key memory bandwidth resources caused by indiscriminate treatment in the prior art. Optionally, the first QoS priority corresponding to the QoS ID may be an initial QoS priority corresponding to the QoS ID, for example, an initial QoS priority set by the target Master for the QoS priority, or a QoS priority preset by a host in the AI system for the QoS priority, etc.; or the first QoS priority is a final QoS priority corresponding to the QoS ID, that is, the QoS priority may be a final QoS priority after the initial QoS priority is adjusted, for example, based on the initial QoS priority corresponding to the QoS ID, some adjustments, optimizations, and the like are performed on the initial QoS priority in the process of the access request circulation, for example, the QoS priority is temporarily increased or decreased, for example, the QoS priority after the temporary increase or decrease is recovered, for example, the initial QoS priority is subjected to a series of temporary adjustments and then is subjected to final adjustment to obtain a final first QoS priority, which is not limited in particular in the embodiment of the present application.
Optionally, in the embodiment of the present application, different service flows may correspond to different QoS IDs, that is, qoS IDs corresponding to or carried by computing tasks split from different service flows are different, and computing tasks split from the same service flow correspond to the same QoS ID, that is, qoS IDs at this time are used to indicate different service flow types. In other words, in the embodiment of the present application, the computing tasks may be classified according to the type of the service flow, the computing tasks in the same class correspond to the same QoS ID, and the computing tasks in different classes correspond to different QoS IDs. Or alternatively, the QoS ID in the embodiment of the present application may also be used to indicate other types or classification manners (such as a computing type of a computing task, a importance type of a computing task, or classification according to a time period for executing a computing task, or according to a memory access delay requirement of a computing task, or according to a purpose of accessing a memory by a computing task, etc.), that is, when it may be determined in an AI system that a competing relationship of a memory access request exists between computing tasks of two types, the classification of the computing task may be performed in this manner, and the computing task of each type may be assigned the same QoS ID, while the computing task of different types may be assigned different QoS IDs.
In the AI computing field, an on-chip memory access quality of service (QoS) control technology is introduced, by marking QoS for each computing task to be allocated to an AI SoC in an AI system, and different types of computing tasks have different corresponding QoS IDs (for example, classifying according to a service flow to which the computing task belongs, or classifying according to different access delay requirements of the computing task, etc.), so that a QoS priority corresponding to an access request of each computing task can be determined according to the QoS ID carried in each computing task, and finally QoS control is performed on each access request based on the determined QoS priority, thereby achieving the purpose of performing access QoS control on a memory of the AI system from granularity of the computing task, simultaneously achieving the function of providing different access service guarantees for computing tasks of different types (such as different service flows), and finally obtaining better AI computing performance on the basis of the existing computing power and memory bandwidth resources of the AI system. In comparison with the prior art, access control is only performed on the Master level in the SoC (i.e. all access requests on the same Master are controlled according to the unified access service quality), so that the actual access requirements of various different computing tasks (such as computing tasks belonging to different service flows) in the AI system cannot be met, and the problem of poor computing performance of the AI system is finally caused.
In summary, the embodiment of the application finally realizes the control of the memory access service quality from the granularity of the calculation task, solves the problem of insufficient memory bandwidth caused by the concurrent competition of various calculation tasks (such as different types of service flows) in the memory bandwidth process in the AI training and reasoning task, and can preferentially ensure the calculation task with higher delay requirement in the AI training and reasoning, thereby utilizing the memory bandwidth resource in a larger and more efficient way, finally realizing the load balance of the memory access of the whole AI system and improving the comprehensive execution performance and efficiency of the whole AI system. In addition, in the embodiment of the application, when the class of the computing task is divided according to the type of the service flow, the problem of performance jitter of the AI training and reasoning task caused by lack of the means of identifying, controlling and optimizing the priority of the service flow, for example, the delay jitter in the training process, can greatly influence the improvement of the linearity of the AI cluster scale, so that the computing power of a large number of AI cluster machines cannot be utilized to the greatest extent, thereby wasting precious AI computing resources and increasing the model training cost and time expenditure of clients. The embodiment of the application can avoid performance jitter (such as time delay jitter) and improve the linearity of the AI system.
In a possible implementation manner, the computing task further carries a second QoS priority corresponding to the QoS ID, where the second QoS priority is an initial QoS priority (such as a base priority, a default priority, or a default priority) corresponding to the QoS ID in the computing task. In the embodiment of the present application, in addition to the QoS ID of the computing task, the computing task allocated to each Master in the subsystem in the AI system may also carry an initial QoS priority (i.e., a second QoS priority) corresponding to the QoS ID in the computing task. That is, in the embodiment of the present application, the QoS ID and the corresponding initial QoS priority may be configured for the computing task at the beginning of the computing task allocation, so that the subsequent adjustment and control of the subsequent QoS priority may be performed according to the QoS ID and the initial QoS priority, and the corresponding memory QoS access control may be performed. Optionally, in the embodiment of the present application, the QoS ID carried in the memory request of the computing task may remain unchanged during the transfer from the Master to the target memory controller, but the corresponding QoS priority may be adjusted and optimized differently according to the sum of different requirements of the memory request during the scheduling. The reason is that, since the class to which each computing task belongs does not change during execution of the computing task (at least does not change during execution of a certain time), the QoS ID for indicating the class to which the computing task belongs may not change naturally, but since the QoS priorities corresponding to different QoS IDs may be different in the actual scheduling process along with the access request of the computing task, or in the actual access process, many other factors or conditions need to be considered, and therefore, the QoS priorities corresponding to a certain QoS ID may change.
In the present application, since each subsystem in the AI SoC 20 in the AI system 01 or 02 includes one or more masters, when a plurality of masters in a certain subsystem execute computing tasks in parallel, memory requests related to a plurality of computing tasks in the subsystem are scheduled to the system-on-chip bus 202 according to what rule, so as to be finally sent to a corresponding memory controller. Therefore, the AI system 01 or 02 in the embodiment of the present application may further include: and scheduling the access requests generated by a plurality of masters inside the subsystem inside each subsystem. In connection with some embodiments of the present application, it is described in detail how the AI system 01 or 02 schedules access requests generated by the respective masters during execution of computing tasks within the subsystem.
In a possible implementation manner, the target subsystem further comprises a sub-scheduler; the target Master is specifically used for: the access request is sent to the sub-scheduler, and is scheduled to a target memory controller in the N memory controllers through the sub-scheduler; the sub-scheduler is configured to: receiving access requests sent by the S masters in the target subsystem respectively; scheduling the access requests sent by the S masters to the SoC bus according to second QoS priorities corresponding to QoS IDs carried in the access requests sent by the S masters respectively, wherein the second QoS priorities are initial QoS priorities corresponding to the QoS IDs; the second QoS priority is used to indicate a priority of the corresponding access request being scheduled to the SoC bus.
In the embodiment of the application, each subsystem in the AI SoC in the AI system also comprises a sub-scheduler which can be used for scheduling the memory access requests of the computing tasks being executed in all the masters in the subsystem, and the memory access requests generated by the masters in the subsystems are sent to the SoC bus for arbitration, address analysis and routing after being scheduled by the internal sub-schedulers and then are issued to the corresponding memory controllers for memory access. Because the access requests of the computing tasks executed in each Master carry the QoS IDs in the corresponding computing tasks, in the process of scheduling the access requests, the sub-schedulers in each subsystem can schedule the access requests carrying the QoS IDs with higher QoS priorities to the SoC bus preferentially, and delay the access requests carrying the QoS IDs with lower QoS priorities to the SoC bus, so that the QoS priorities corresponding to the access requests are considered in the process of issuing the access requests to the SoC bus, and accordingly access control services matched with the QoS IDs of the access requests are provided for each computing task from the source of the whole AI system.
In one possible implementation manner, the sub-scheduler is specifically configured to: respectively establishing task queues for the S masters, wherein each task queue comprises a memory access request sent by the corresponding Master; wherein, the target Master corresponds to a target task queue; when a target access request is currently inserted into the target task queue, respectively lifting the second QoS priorities corresponding to the QoS IDs carried in all the access requests in the target task queue to third QoS priorities, wherein the target access request is an access request with the second QoS priorities corresponding to the QoS IDs exceeding a preset priority; and according to the second QoS priority or the third QoS priority corresponding to the QoS ID carried in the memory access requests in the task queues of the S masters, the memory access requests in the task queues of the S masters are sequentially sent to the SoC bus. That is, the second QoS priority may be understood as the source QoS priority corresponding to the QoS ID, and the third QoS priority may be understood as the associated priority corresponding to the QoS ID.
In the embodiment of the application, in the process of specifically scheduling access requests by a sub-scheduler of each subsystem, a task queue is created for a calculation task in each Master, all access requests generated in each Master are placed in one task queue, and QoS priorities corresponding to QoS IDs carried by the access requests in each task queue are sequentially sent to an SoC bus; when a higher QoS priority access request currently appears in a task queue, in order to avoid the situation that all access requests in the task queue are forced to be blocked (e.g. queue head blocking) due to the lower QoS priority of the access request at the front end of the task queue, in the embodiment of the present application, the sub-scheduler in the Master promotes the QoS priority of all access requests in the task queue (i.e. from the second QoS priority to the third QoS priority), so that any access request (especially the access request with the higher QoS priority mentioned above) in the task queue will not cause the blocking of the access request of the whole task queue due to the lower QoS priority of the access request at the front end (or queue head) in the task queue, thereby optimizing the efficiency and effect of QoS access level control as a whole.
In the present application, when each subsystem in the AI SoC20 sends access requests of a plurality of masters therein to the SoC bus, the problem of how the SoC bus should schedule access requests sent by the plurality of subsystems is involved. Therefore, the AI system in the embodiment of the present application may further include: and dispatching access requests sent by the subsystems to the corresponding memory controller. In connection with some embodiments of the present application, it is described how the AI system 01 or 02 schedules access requests on each subsystem to the corresponding memory controller.
In one possible implementation, the SoC bus is configured to: receiving one or more access requests in the target task queue sent by the sub-scheduler, wherein the one or more access requests comprise the access request; and restoring the third QoS priority corresponding to the QoS ID carried in one or more access requests in the target task queue to the corresponding second QoS priority.
In the embodiment of the application, when the access requests in each task queue are subjected to QoS priority adjustment from the sub-schedulers of each Master, and after the access requests are subjected to scheduling according to the adjusted QoS priorities, the access requests are scheduled to the SoC bus, and at the moment, the risk of blocking the access requests with low QoS priorities in each task queue is eliminated by performing QoS priority adjustment in the subsystem, so that after the access requests are scheduled to the SoC bus from each task queue, the access requests can be restored to the previous QoS priorities, namely, the third QoS priorities are restored to the corresponding second QoS priorities, thereby facilitating QoS control of the access according to the QoS priorities corresponding to the QoS IDs initially allocated to each calculation task by the AI system.
In one possible implementation, the SoC bus is further configured to: and based on the second QoS priority after the recovery of the one or more memory requests in the target task queue, respectively scheduling the one or more memory requests in the target task queue to corresponding memory controllers in the N memory controllers.
In the embodiment of the application, after the SoC bus restores the QoS priority of the QoS ID of the access request scheduled from each subsystem to the initial second QoS priority, the scheduling of the access request can be performed according to the restored second QoS priority, that is, each access request is scheduled to the corresponding memory controller according to the restored second QoS priority, so that the memory controller performs subsequent access QoS control and memory access.
In one possible implementation, the AI SoC further includes an advanced memory access agent MATA: the SoC bus is specifically configured to: and sending one or more memory access requests in the target task queue to the MATA, and respectively scheduling the one or more memory access requests to corresponding memory controllers in the N memory controllers through the MATA. Optionally, in another possible implementation manner, the AI SoC further includes an advanced memory access agent MATA: the SoC bus is specifically configured to: and sending the memory access requests respectively sent by the S masters to the MATA, and respectively dispatching the memory access requests respectively sent by the S masters to corresponding memory controllers in the N memory controllers through the MATA, wherein the memory access requests respectively sent by the S masters comprise the memory access requests.
In the embodiment of the present application, the AI SoC may further include a memory access agent MATA for performing memory access control, and in the process of scheduling the memory access request in each subsystem to the corresponding memory controller by the SoC bus, the memory access request may also be specifically scheduled to the corresponding memory controller by the MATA, that is, by the MATA, the memory controllers may be comprehensively controlled and managed, and the received memory access requests may also be further regulated and controlled, for example, the second QoS priority corresponding to the QoS ID in each memory access request may be further optimized.
In one possible implementation, the MATA is configured to: receiving the access request, and determining the second QoS priority corresponding to the QoS ID carried in the access request; determining the first QoS priority corresponding to the QoS ID based on the second QoS priority corresponding to the QoS ID, in combination with historical memory bandwidth statistics corresponding to the QoS ID, and a memory policy control parameter corresponding to the QoS ID, where the memory policy control parameter includes one or more of a highest bandwidth, a lowest bandwidth, and an access priority that allow an access request to pass. For example, the MATA compares the historical memory bandwidth statistics information corresponding to the QoS ID (such as the sum of the total bandwidths occupied by all access requests carrying the QoS ID on all N memory controllers) with the access policy control parameter (such as the highest bandwidth actually configured for the QoS ID), then calculates to obtain the floating priority corresponding to the QoS ID, and then adds or subtracts (adds if subtracts) the floating priority and the second QoS priority corresponding to the QoS ID, and finally obtains the first QoS priority of the QoS ID.
In the embodiment of the present application, after the MATA receives each access request scheduled by the SoC bus, further optimization adjustment may be performed on an initial priority (i.e., a second QoS priority) carried in each access request, and a specific adjustment principle may include, before the access request is scheduled by the SoC bus to each memory controller, generating, by the MATA, a QoS priority (i.e., a first QoS priority) finally corresponding to the QoS ID according to the corresponding initial QoS priority (i.e., the second QoS priority) of the QoS ID carried in the access request and combining historical memory bandwidth statistics information corresponding to the QoS ID and access policy control parameters corresponding to the QoS ID currently recorded and stored by the MATA, so that the target memory controller may finally perform access control on the access request according to the final QoS priority. That is, when performing access control, the MATA considers not only the QoS priority initially configured for each QoS ID by the AI system, but also the historical bandwidth statistics information corresponding to each QoS ID (such as the memory bandwidth information that is currently obtained by a class of computing tasks carrying a same QoS ID) and the access policy control parameters (such as the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass through) that should be configured for the access request to obtain the access policy control parameters (such as the highest bandwidth, the lowest bandwidth, and the access priority that allow the access request to pass through) that are configured for the access policy control parameters, so as to comprehensively consider what access QoS control service is provided for the current access request, and finally obtain the QoS priority that matches with the access request, thereby performing more accurate access QoS control, and further optimizing and improving the computing performance of the AI system. For example, the memory access request corresponding to a certain QoS ID already occupies a large amount of memory bandwidth, and for balance consideration, the QoS priority of the QoS ID may be reduced to balance the memory access bandwidth occupation amount corresponding to each QoS ID; if the memory access request corresponding to a QoS ID currently occupies less memory bandwidth, the QoS priority corresponding to the QoS ID may be improved to make up for the memory access bandwidth occupation corresponding to the QoS ID.
In one possible implementation, the MATA is further configured to: presetting access policy control parameters corresponding to each QoS ID, and counting and recording historical memory bandwidths corresponding to each QoS ID; and updating and optimizing the access policy control parameters corresponding to each QoS ID according to the access performance real-time monitoring information of the AI system. Optionally, according to the access performance real-time monitoring information of the AI system, updating and optimizing the access policy control parameters corresponding to each QoS ID through an optimizing algorithm and a self-adaptive machine learning algorithm.
In the embodiment of the application, the MATA configures corresponding access policy control parameters for each QoS ID, and also counts and records the historical memory bandwidth corresponding to each QoS ID, so as to determine whether to improve QoS priority or reduce QoS priority on the basis of the initial priority corresponding to a certain QoS ID according to the two information. And finally determining the final QoS priority of a certain QoS ID corresponding to the access request, so that the memory controller can perform specific access QoS control according to the final QoS priority. For example, the MATA may set the highest bandwidth, lowest bandwidth, access priority, etc. that access requests carrying a certain QoS ID are allowed to pass.
In one possible implementation, the MATA is further configured to: and carrying the first QoS priority in the access request, and scheduling the access request to the target memory controller based on the first QoS priority. Optionally, the MATA may send the QoS ID to the memory controller in addition to the first QoS priority carried in the memory request.
In the embodiment of the present application, after the MATA determines the final priority of the memory access request, the final priority (i.e., the first QoS priority) may be carried in the memory access request and sent to the corresponding memory controller, so that the corresponding memory controller may perform the memory access QoS control according to the first QoS priority. In addition, the MATA may schedule the memory request to the target memory controller based on the first QoS priority. Optionally, if the MATA carries the first QoS priority in the access request, and further carries the QoS ID in the access request and sends the access request to the memory controller, the memory controller may jointly make a decision for controlling the access QoS according to the first QoS priority and the QoS ID, for example, the memory controller may calculate, according to the QoS ID, a historical memory bandwidth occupied by the access request corresponding to the QoS ID on itself, and further optimize the access QoS control according to this.
Optionally, in another possible implementation manner, the AI SoC further includes a MATA; the MATA is used for: and carrying the determined first QoS priority in the access request, and scheduling the access request to the target memory controller based on the first QoS priority. That is, the MATA may carry the first QoS priority determined by itself or by other modules in the AI system in the memory request, and schedule the memory request to the corresponding target memory controller based on the first QoS priority. For example, the first QoS priority may be a final QoS priority obtained by performing related adjustment and optimization according to an initial QoS priority carried in a memory access request in a MATA, an initial priority configured by the MATA for a QoS ID in the memory access request, or the like, or a QoS priority configured by another module (such as a host or a related Master) for a QoS ID in the memory access request and notifying the QoS priority to the MATA, and then the QoS priority is stored and carried in the memory access request by the MATA.
In the present application, after access requests in each subsystem in the AI SoC 20 are scheduled to the corresponding memory controller by the SoC bus, it relates to how the memory controller specifically performs access control on the access request according to the received access request. Therefore, the AI system in the embodiment of the present application may further include: and in particular how to perform the function of access QoS control. In connection with some embodiments of the present application, it is described in detail how the AI system 01 or 02 provides appropriate access QoS control for different access requests.
In one possible implementation manner, the target memory controller is specifically configured to: receiving the access request, and determining the first QoS priority corresponding to the QoS ID; and performing access QoS control on the access request based on the first QoS priority corresponding to the QoS ID and combining the access service condition of the target memory controller, wherein the access service condition comprises a memory access time sequence requirement or a memory bandwidth bus utilization rate. For example, the access service condition may include the read-write timing requirement of the DDR controller (because when there are multiple access requests that need to access a certain memory controller at the same time, the corresponding timing requirement must be satisfied); or the DDR bandwidth bus utilization rate, namely the access efficiency condition, for example, in order to make the DDR bus utilization rate higher, if a plurality of concurrent access requests exist, the memory controller can preferentially process the data of the same bank and the same row; or some read-write rules; or some read-write condition of the memory controller itself, etc.
In the embodiment of the application, after the memory access request is scheduled to each memory controller by the SoC bus through the MATA, the memory controller can perform the memory access QoS control on the memory access request according to the final QoS priority (i.e., the first QoS priority) carried in the memory access request and in combination with the current service condition of the memory controller. That is, when the memory controller performs access QoS control, not only the QoS priority finally generated by the MATA for each QoS ID is considered, but also the current service condition of each memory controller (for example, the access timing requirement on the memory controller, the memory bandwidth bus utilization rate, etc.) is further considered at the same time, so as to perform more accurate access QoS control, thereby further optimizing and improving the computing performance of the AI system. Optionally, when the access request received by the memory controller further carries a QoS ID, the memory controller may further calculate, according to the QoS ID, a historical memory bandwidth occupied by the access request corresponding to the QoS ID on its own, and further optimize the access QoS control according to the historical memory bandwidth.
In one possible implementation, the target memory controller is further configured to: and when the amount of the access requests received by the target memory controller is greater than a preset threshold value, broadcasting a back pressure indication to the M subsystems, wherein the back pressure indication is used for indicating one or more subsystems in the M subsystems to delay, reduce or stop sending the access requests.
In the embodiment of the present application, when the number of access requests received by a certain memory controller is excessive, the related subsystem may be instructed to reduce or delay or even stop the currently transmitted access requests, and after receiving the instruction, the related subsystem may adjust the transmission of the access requests according to its own situation, for example, suspend the transmission of the access requests to the SoC bus, or stop the transmission of the access requests to the SoC bus, etc.
In the present application, when the Master in the AI SoC20 subsystem receives a computation task to be executed, the computation task already carries a corresponding QoS ID, so the AI system in the embodiment of the present application may further include: identifying computing tasks, allocating and carrying corresponding QoS IDs for each computing task, and the like. In connection with some embodiments of the present application, it is described in detail how the AI system 01 or 02 assigns and carries a corresponding QoS ID for each computing task before distributing the computing task to be executed to the target Master.
In one possible implementation manner, the AI system in the embodiment of the present application further includes a host; the host is used for:
Receiving a task to be executed, and splitting the task to be executed into one or more computing tasks to be executed; identifying the service flow type of the one or more calculation tasks to be executed after splitting according to a preset service flow label table, wherein the preset service flow label table comprises a mapping relation between the service flow type of the predefined calculation task and QoS ID; and carrying corresponding QoS IDs for the one or more computing tasks to be executed respectively according to the identification result.
Specifically, first, the host 10 needs to split a complete task to be executed into computing tasks that can be understood by respective hardware (masters), for example, split an entire AI task into matrix computing tasks to be executed, scalar computing tasks, vector computing tasks, and the like. The Host 10 side then marks all the split computing tasks with access QoS tags, i.e. matches the appropriate QoS ID, for example, by an identification module in the Host System (Host System) to assign the appropriate QoS ID to each computing task assigned to the AI System. In the embodiment of the present application, the basis for the host 10 to allocate QoS IDs to the computing tasks may be according to the service flow types to which the computing tasks belong (the classification of the service flows in the embodiment of the present application may refer to the related description of the corresponding embodiment of table 2, which is not repeated here), that is, the QoS IDs carried by the computing tasks belonging to the same service flow are the same, and the QoS IDs between the computing tasks in different service flows are different. It should be noted that, after that, when the computing task is allocated, the system scheduler 200 determines the task type rather than the QoS ID carried in the computing task, that is, specifically, by identifying a task descriptor carried in the computing task, such as what operation is specifically performed, performing ordinary computation or performing image processing computation, etc., according to the task descriptor, the system scheduler 200 may select an appropriate subsystem and an appropriate Master for each computing task according to a preset scheduling principle (for example, when there are multiple masters selectable, a relatively idle Master may be selected). In summary, when the host 10 allocates QoS ID to a computing task, the host is based on the service flow type to which the computing task belongs; while the system scheduler 200 assigns computing tasks to the respective subsystems, it is based on task descriptors, i.e., task types, carried by the computing tasks. That is, although both processes involve the recognition process for the computing task, the recognition rules and criteria are not the same. In the embodiment of the application, the AI system comprises a plurality of subsystems and a plurality of memory controllers which can execute the calculation tasks, and a Host for uniformly receiving various calculation tasks issued by users, wherein the Host can identify and mark the types of the service flows in the AI network model, namely, different access QoS labels, namely QoS IDs, of the service flows are given for the calculation tasks under different service flows, so that the following whole AI system can reasonably and matched access QoS control on the calculation tasks carrying the QoS IDs according to the QoS IDs, thereby finally realizing the load balance of the access of the whole AI system and improving the comprehensive execution performance and efficiency of the whole AI system.
Further, in one possible implementation, the AI SoC further includes a system scheduler; the host computer is further configured to: and sending one or more computing tasks carrying the corresponding QoS IDs to the system scheduler. In the embodiment of the present application, after the host in the AI system identifies the service flow and carries the QoS ID, the computing tasks carrying the QoS ID may be sent to the system scheduler on the AI SoC for subsequent allocation. That is, after the host computer splits, identifies and labels the tasks to be executed, the processed computing tasks are issued to the system scheduler, so that the system scheduler can conveniently schedule and distribute the computing tasks which are already labeled (i.e. carry the matched QoS IDs).
Optionally, for the above-mentioned preset traffic label table related to the process of identifying the traffic type to which the computing task belongs for the host 10, priority distinction can be made between traffic flows with a competitive relationship. For example, the types of traffic flows corresponding to different AI network models may be different, and thus the identification process may be different. In addition, the traffic flows concurrently running in different AI network models may be different, so that the different AI network models may correspond to the dedicated traffic flow classification and the matching relationship between the traffic flow (QoS ID) and the QoS priority formed according to the traffic flow classification and the concurrent running condition. One computing task (subtask) in the application can be a code, a thread, and a thread has a plurality of steps, and each step can schedule different functions, for example, some computing tasks are normalized by pictures, some computing tasks do matrix operation, some computing tasks do addition and subtraction, and the computing tasks can be respectively used for a GPU, an NPU and a communication module. All the above calculation tasks can belong to the same service flow, have the same QoS ID and have the same access QoS priority. It can be understood that before a certain AI network model is run, it can be identified from the AI network model which traffic flows are classified, whether the traffic flows are executed concurrently, and then, according to the influence of different traffic flows on the system, which traffic flows can correspond to high memory QoS priorities and which traffic flows can correspond to low memory QoS priorities. As shown in fig. 2A, fig. 2A is a schematic diagram of a relationship between a service flow and a graph node and a computing task according to an embodiment of the present application, where in fig. 2A, one execution task may involve multiple service flows, and one service flow may involve multiple graph nodes in a graph running stage, and one graph node may further include multiple computing tasks, and each computing task is formed by multiple operators. Each computing task is ultimately distributed to the respective masters in the same or different subsystems for execution, depending on the type of task (i.e., the task type described by the task descriptor in the present application).
The following illustrates a preset traffic label table (i.e. table 2) provided by the embodiment of the present application, where the traffic label table includes traffic classification of access data flows sent in the AI model training or reasoning process to the storage device, and corresponding flow direction, qoS ID, initial QoS priority (i.e. second QoS priority), and traffic description. It should be noted that, in this table 2, a possible traffic classification manner and a matching relationship between a corresponding QoS ID and a QoS priority provided by the presently known AI network model according to the embodiment of the present application may be adapted and adjusted accordingly for the unknown AI network model, which is not listed here. The details are shown in table 2 below:
TABLE 2
In the above table 2, the types of traffic flows are classified into the following types: control flow, intra-layer model parallel feature map communication data flow, data parallel parameter prefetching flow, feature map sharing flow, feature map prefetching flow, embedded (Embedding) read-write flow, data parallel parameter global reduction operation (All Reduce) flow, AI CORE calculation flow, CMO operation flow, general CPU calculation flow, image video accelerator sample flow and the like. In addition, according to the flow direction of the service flow, the service flow is divided into two types of flow directions D2D and H2D, please refer to fig. 2B, fig. 2B is a schematic diagram of a flow direction of the service flow provided by an embodiment of the present application, as shown in fig. 2B, the H2D flow refers to service flow data D2D flow transmitted from a Host to a Device (accelerator card) mutually, and the service flow data transmitted from a different Device (accelerator card) mutually also includes service flow data of each accelerator accessing a local memory in the SoC on the same Device. These traffic data in different flow directions eventually reach the memory controller of each SoC via complex on-chip buses or through inter-communication technologies, such as RDMA, PCIE, or a halftoning HCCS bus, and become a read/write access request transaction to the memory unit (DDR/HBM) once. By studying different AI network models, it is shown that AI networks of different types and deployment scales (such as resnet, yolo, wide & Deep, GPT2, GPT3, etc.) can generate read/write request types to the storage system, which are basically summarized in the above table, and it is understood that the traffic models of different AI network models may be different during operation, but the traffic flows may still be classified in the above classification manner, and each traffic flow has a specific role, so the above classification method or core idea for the traffic flows may be applied to any network model. The above-mentioned process of dividing the traffic flows may be performed before the corresponding AI network model is run, for example, before the AI network model is run, the traffic flows to which the computing task belongs may be identified by determining which memory access operations are involved in the computing process, which functions are called, which APIs are called, and so on.
Further, in table 2 above, it can be seen that, for each of the different traffic flows, different class labels, that is, qoS IDs may be allocated to the traffic flows, and in table 2 above, it can be seen that the control flow, the intra-layer model parallel feature map communication data flow, the data parallel parameter prefetch flow, the feature map sharing flow, the feature map prefetch flow, the embedded (Embedding) read-write flow, the data parallel parameter global reduction operation (All Reduce) flow, the AI CORE calculation flow, the CMO operation flow, the general CPU calculation flow, and the image video accelerator sample flow respectively correspond to different QoS IDs, that is, 1,2, 3, 4, 5, 6, 7, 8, 9, 10, and 11, and the QoS priorities corresponding to the different QoS IDs may be the same or different, and may depend on whether there is a memory access contention between the flows. For example, in table 2, the QoS ID corresponding to the control flow is 1, and the QoS priority corresponding to the control flow is 1; qoS ID corresponding to the intra-layer model parallel feature map communication data flow is 2, and QoS priority corresponding to the intra-layer model parallel feature map communication data flow is 2; the QoS IDs corresponding to the data parallel parameter prefetching flow is 3, the QoS IDs corresponding to the feature map sharing flow is 4, the QoS IDs corresponding to the feature map prefetching flow is 5, and the QoS IDs corresponding to the Embedding read-write flow is 6, but the QoS priorities corresponding to the different service flows with the corresponding QoS IDs are all 3, etc. (in the table 2, the smaller the value of the QoS priority is, the higher the level is), so it can be seen that the QoS IDs corresponding to the different service flows are different, but the different QoS IDs can correspond to the same QoS priority, or can correspond to the different QoS priorities, depending on whether there is a memory access contention between the different service flows, if there is a contention, the different service flows can correspond to the different QoS priorities, so as to provide the calculation tasks under different service flows with different requirements with different QoS service avoiding excessive contention, and if there is no contention, the different service flows can also correspond to the same QoS priority.
The following illustrates how the corresponding traffic class can be obtained by analyzing the AI network model in the present application. For example, referring to fig. 2C, fig. 2C is a schematic diagram illustrating a relationship between a traffic type and a memory bandwidth involved in an operation process of a resnet network according to an embodiment of the present application, and fig. 2C is a schematic diagram illustrating various traffic flows of the resnet network in an operation process, including forward computing FP, backward computing BP, and gradient aggregation parameters, parameter updating, and application traffic flows. As can be seen from fig. 2C, after the stage1 stage 1671GB of the gradient aggregate flow is overlapped with the backward computed flow (672 GB), the total bandwidth which can be provided by the HBM is exceeded by 2000GB (2 TGB), in this case, in order to ensure the best performance of the whole system, different bandwidths and access QoS priorities need to be allocated to the gradient aggregate and the backward computed flow respectively, for example, in the example shown in fig. 2C, the embodiment of the present application considers the memory access bandwidth priority for providing the gradient aggregate, so as to reduce the tailing time of the AI computing process and ensure that the delay of each round of iteration is the shortest. In contrast, without effective management and control of memory access behavior and priority of these traffic flows, competition of various traffic flows for valuable memory bandwidth is difficult to avoid, resulting in difficult control and optimization of the resulting performance jitter.
The embodiment of the application can classify and identify the service flow of the AI network model, then gives different service flow labels (namely QoS IDs) for different service flow types, on the basis, provides a simple technical means for customers according to the service flow model of the actual AI network of the customers based on the hardware capacity of the AI SoC in the AI system, adjusts a set of QoS configuration parameters conforming to the actual AI network model of the customers for different service flows, ensures the customers to obtain the best model training performance on the AI platform, and helps the customers to release the maximum calculation power of the hardware.
Based on the hardware architecture of the AI system and the functions of each component in the hardware architecture of the AI system, the embodiment of the application also provides a framework (i.e., davinci software stack) of a software stack running in the hardware architecture, which can be used for specifically implementing the corresponding functions of the AI system in the application. It should be understood that the Davinci software stack is only one possible AI platform or software architecture (software stack) for implementing the AI system of the present application, and is not intended to limit the AI platform or software stack to which the AI system of any one of the embodiments of the present application is applicable. The following describes each software module in the software stack framework, and the functions of each software module or the software flow involved therein as examples.
Referring to fig. 3A, fig. 3A is a schematic diagram of a framework of a Davinci software stack according to an embodiment of the present application, in the framework of a Davinci software stack shown in fig. 3A, the framework of the Davinci software stack is mainly divided into a HOST side (i.e. corresponding to the HOST 10 side in the present application) and a DEVICE side (i.e. corresponding to the AI SoC 20 side in the present application), and the software modules and their corresponding functions related to the interior of the HOST side and the DEVICE side are described respectively. The software modules related to the QoS access control function in the AI system in the application specifically may include the following:
HOST side
1. Graph generation engine_model deployment subsystem_ TensorFlow adaptation layer (GE_MDS_TFD)
These graph Generation Engine (GE) modules can train labels in scripts and operator types used for each sub-graph based on model graph context information, can identify different traffic flows such as forward computation, backward computation, aggregate communication, etc. in the training process, and can label the different traffic flows predefined in advance by a framework and then transmit the different traffic flows to GE/HCCL.
2. Graph generation engine/graph fusion engine/Chinese is an aggregate communication library (GE/FE/HCCL)
The GE/FE/HCCL calls an API interface provided by a QoS Manager module according to the service flow label transmitted by the graph switching and model deployment subsystem (Model Distribution Subsystem, MDS), acquires the QoS ID and QoS corresponding to the service flow label, transmits the information to task queue descriptors (Send Queue Element, SQE) of RUNTIME de-assembly tasks (tasks), and finally transmits the SQE of the tasks to task queues (RunTime Send Queue, RTSQ) at the running time by the RUNTIME.
3. Quality of service management library (libQoSManager)
LibQoSManager is used as a dynamic library, and based on the global QoS planning table, an application interface of QoS resources of QoS configuration items is provided for different data flows of the GE/HCCL/DVPP and other services; GE/HCCL provides a service flow label, and calls the interface to acquire QoS ID and QoS information corresponding to the label in the global planning table.
4. Service quality self-adapting tool (QoSAutoAdjustTools)
QoSAutoAdjustTools is a command line tool running on the Host side, the main functions include:
(1) The QoS configuration information of all devices belonging to the Host server management may be queried, or the QoS configuration information of the designated Device may be queried, where the configuration information is consistent with the QoS register of the MATA. And displaying the information of QoS priority, bandwidth high-low waterline, hardware limit (hardlimit) whether the QoS priority, bandwidth high-low waterline, hardware limit (hardlimit) corresponding to each paritd, the name of the service flow corresponding to the QoS ID, the label of the service flow and the like in a list mode. Such information may be displayed on a command line or may be saved in a designated file.
(2) The actual statistical flow of all QoS IDs of the appointed equipment or all equipment is queried, and a thread needs to be started to periodically issue a command to a QoS Monitor driver at the equipment side to acquire real-time data. After the data is acquired, it can be saved to a designated file or displayed on a command line.
(3) The function mainly uses a debugging stage, can help a user to quickly optimize QoS configuration and try the benefits brought by different QoS schemes. After acquiring QoS adjustment experience, an automatic QoS adjustment algorithm is also implemented in the tool.
(4) And supporting the bandwidth corresponding to all QoS IDs in the QoS global planning table at one time, issuing QoS priority to QoS drivers at the equipment side, and configuring the QoS drivers into MATA registers by the drivers.
(5) Based on the GA/RL automatic optimizing algorithm, the bandwidth which can be improved by the actual HBM of various NN training networks and equipment purchased by users can be self-adapted, and the optimal QoS priority and bandwidth configuration of each service flow can be automatically searched.
The implementation of the functions of the tool depends on a QoS newly added API interface provided by a Device state management interface (DEVICE STATE MANAGE INTERFAACE, DSMI) at the Host side, and the corresponding functions are finally implemented by a QoS driver at the Device side through the forwarding of the interface and a DSMI driving framework of the Device.
5. QoS equipment state management interface (QoS_ DSMI)
DSMI is a Device management application program interface (Application Programming Interface, API) interface common to the D-series chip, where the bottom layer communicates with DSMI driver framework on the Device side through a communication channel (Host Device Communication, HDC) between the host and the Device, and the DSMI driver framework invokes a QoS driver on the Device side to implement related configuration issuing and status querying functions. With this mechanism, management of Device-side QoS related functions can be greatly simplified. Since QoS related configuration issues and status queries are both new functions, new interfaces and function support needs to be added at DSMI.
(II) DEVICE side
1. Device state management interface drive frame (DSMI _DRV_ FRAMWORK)
The DSMI driving framework is a set of general driving modules running in a kernel mode, the modules provide an implementation of a DSMI command forwarding framework which is easy to expand, a kernel mode command registration interface is provided, a DSMI interface is convenient for the newly added driving module to realize, qoS driving only needs to register processing functions of newly added commands required by QoS to the framework, and after receiving configuration and query commands related to QoS of HOST side, the framework automatically forwards the commands to callback processing functions registered in advance by the QoS module.
2. Quality of service Driver (QoS Driver) module
The QoS Driver module is deployed in a kernel mode QoS Driver module of a Device side, and the QoS Driver module and a QoS Host Driver of a Host side establish a communication channel through a kernel mode interface DSMI to complete a main QoS management function, and the QoS Driver module mainly comprises the following four sub-modules:
(1) Device state management interface HOOK (DSMI _hook) module
The DSMI _hook module is responsible for implementing various QoS configuration and query commands on the host side and registering these command implementation interfaces into the DSMI drive framework. When DSMI driving framework receives QoS related command, automatically calling the callback functions registered in advance by QoS to complete query and configuration of QoS related command.
Of particular note is the processing command to configure bandwidth for a given QoS ID, which, if a virtualized scenario, requires return commands not to be supported. Because the bandwidth configuration of the virtualization scenario is configured by QoS drivers automatically calculated from the computing power of each virtual function unit (Vitual Function, VF), not by the user through the DSMI interface.
For the bandwidth query command, in the virtualization scenario, the virtual function unit number (Vitual Function ID, VFID) carried in the DSMI command needs to be converted into the QoS ID corresponding to the VFID, and then the bandwidth corresponding to the QoS ID is searched in a monitor maintained by the QoS driver. Because the physical QoS ID is not visible to the end virtual machine user in the virtualized scenario.
(2) Input-output control (IOCTL) module
The IOCTL module is responsible for encapsulating QoS driver into a character platform DEVICE driver and providing IOCTL interface to process call of user mode of the virtual machine on the DEVICE side in the virtualized scene. For example, the QoS ID of the process needs to be configured.
(3) Quality of service configuration (QoS Config) module
The QoS Config module is mainly responsible for directly configuring QoS corresponding to each QoS ID applied by the Host side to a register of the MATA. And the QoS inquiry command of the QoS tools on the Host side is processed, and the QoS configuration of all QoS IDs on the Device is returned to the QoS tools on the Host through DSMI interfaces and is displayed to the user.
(4) Quality of service Monitor (QoS Monitor)
The module mainly realizes two functions:
① Acquiring the bandwidth flow of the actual HBM of each QoS ID issued to the Device side;
② And carrying out statistical calculation on the collected bandwidth flow data, and providing a query interface for a Device self-adaptive adjustment algorithm and a QoS tools on the Host side, so that the Device is convenient to query the bandwidth statistics regularly.
Based on the introduction of the above-mentioned software modules on the HOST side and the DEVICE side in the framework of the Davinci software stack and their functions, the following describes the related interaction flow of the software modules related to the QoS access control function in the AI system in the present application, please refer to fig. 3B, fig. 3B is a schematic diagram of the interaction flow between the software modules in the Davinci software stack provided by the embodiment of the present application, which specifically may include the following flows:
(1) After the QoS driver module is loaded, a character device is created to provide an IOCTL interface for a user-state device management Protocol (DEVICE MANAGE Protocol, DMP) program or a virtual machine process, so as to control the QoS driver to execute QoS configuration and query commands; note that in a virtualized scenario, it is desirable to be able to support multiple processes to turn on QoS device drivers simultaneously, and there may be concurrent IOCTL commands;
(2) The kernel DSMI drives the framework to execute internal initialization work;
(3) The QoS driving module registers a command processing hook function related to QoS to DSMI driving frameworks;
(4) The libQoSManger module provides an initialization interface, and the NPUTOOL tool calls the interface to perform QoS configuration initialization;
(5) libQoSManger module analyzes the configuration value of each QoS ID in the QoS global configuration table in the initialization function, and loads the configuration value into the memory;
(6) Then, qoS configuration of each QoS ID is packaged into a command message, and the command message is issued to the equipment side through DSMI interfaces;
(7) The DSMI module realizes a transmission interface of QoS configuration information, the bottom layer is realized based on HDC communication, and the information is transmitted to a DSMI driving frame positioned at the DEVICE side after HDC;
(8) The DSMI driving framework at the DEVICE side analyzes the message, recognizes that the message is a QoS related command, and calls a callback hook function provided by the QoS DEVICE driver;
(9) The QoSHook module interprets the QoS IDs in the configuration message, and invokes MataConfig interfaces one by one to configure the bandwidth and QoS value of each QoS ID into the hardware registers of the MATA;
(10) Adding each QoS ID into a Monitor of the MATA so as to count the actual HBM bandwidth of each QoS ID;
(11) In the initialization function of libQoSManager, it is also necessary to invoke the configuration work of QoS ID and QoS for the RoCE engine in DSMI interface, the RoCE supports one PF to configure one QoS ID and QoS, and the RoCE of 1981 supports at most two PFs.
(12) In the initialization function of libQoSManager, the configuration work of QoS ID and QoS of each lane of PCIE is also required to be invoked in DSMI interface, PCIE supports a 48-independent lane, each lane can be configured with a QoS ID and QoS, qoS IDs and QoS of PCIE lanes used by the data plane and the management plane of 1981 need to be configured respectively, and multiple PCIE lanes of the data plane use the same QoS ID and QoS configuration.
(13) The MDS module in GE is responsible for the identification of different traffic flows, and the information of the identified traffic flow labels (i.e. QoS IDs) and on which DEVICE and which DIE the traffic flow is to run, and transfers a GE execution framework in the process of loading the graph.
(14) The GE executing framework applies for different QoS IDs for different service flows (aggregate communication, AI calculation, DVPP and the like) respectively according to the service flow labels and the DEVICE ID information transmitted by the MDS, and stores the QoS IDs in the internal context;
(15) The GE loads an operator through an interface provided by the run time, and carries the QoS IDs and QoS information in tasks, and after the run time receives the tasks, the related operator interfaces are called based on operator types, qoS IDs and QoS priority values, SQE formats of hardware and other task information, and SQE is constructed.
(16) After the SQE is constructed, the task is issued to RTSQ, doorbell is issued, and the stars is triggered to execute task scheduling;
(17) When a QoS-related configuration needs to be recharged, the UninitQoSLib interface provided by libQoSManager is needed, for example, invoked by the NPU TOOL;
(18) libQoSManger in UninitQoSLib interface, parse the standard memory resource partition and monitor (Memory System Resource Partitioning and Monitoring, MPAM) configuration in each destination DEVICE in the QoS global configuration table and package into a clear QoS configuration message into DSMI interface;
(19) DSMI pass the message to the IOCTL sub-module in QoSDriver on the DEVICE side by way of its generic mechanism;
(20) The IOCTL submodule analyzes the message, deletes each QoS ID from the Monitor, and clears the statistical data of the QoS ID in the memory;
(21) The IOCTL submodule parses the message and clears each QoS ID from the MATA's configuration register.
(22) And (23) for the RoCE and PCIE lanes, a similar DSMI interface implementation needs to be provided, except that finally in the driving of the RoCE and PCIE, the QoS IDs and QoS configuration information for these masters need to be cleared. Rather than the bandwidth pipeline and QoS priority configuration assigned to these master corresponding QoS IDs at the MATA.
Based on the above-mentioned classification of the service flow and the corresponding QoS matching method, the present application also provides a method for identifying the service flow type based on the above-mentioned classification method of the service flow, which can cope with endless AI network models and different framework platforms (such as TensorFlow framework, pytorch framework, mindSpore framework, etc.) in industry. The accurate and reliable identification of various traffic flow types in the AI network model is the basis for configuring appropriate QoS parameters (i.e., qoS IDs) for various traffic flows in the subsequent stage. In the following, the embodiment of the present application takes TensorFlow frames as an example, and describes the identification method of various traffic types of the AI network model in combination with the above Davinci computing platform software architecture.
The service flow type identification in the embodiment of the application refers to marking and classifying the data access types to be sent out on each calculation or communication node on the graph in the graph compiling/graph optimizing stage of the AI network model in the graph executing stage. In the embodiment of the application, a method for separating a compiling stage from an operating stage is adopted, as shown in fig. 4A, fig. 4A is a schematic diagram of the compiling stage and the operating stage of a graph provided by the embodiment of the application, abstract service flow type labels are adopted to classify and label different service flows in the graph generating, graph optimizing and compiling stages of an AI network, and the abstract service flow labels are converted into QoS IDs identifiable by physical hardware in a table look-up mode before the graph is loaded on equipment for execution. For example, in fig. 4A, in the graph compiling stage, the task is formed by a graph node a, a graph node B, a graph node C, a graph node D, and a graph node E, and each graph node may carry a corresponding abstract traffic flow type tag (i.e. QoS Lable) in the graph compiling stage, and the QoS value may be a QoS priority of a corresponding traffic flow; in the graph operation stage, after the table look-up, the service flow type label (QoS Lable) carried in each graph node is replaced by the QoS ID identifiable by the AI system.
By the method, the aim of separating model compiling and running can be conveniently achieved, because the compiling of the model is usually finished on a user general CPU system (such as a Host in the application), and the executing of the model is finished on a special AI computing SoC (such as AI SoC 20 in the application). In addition, through the implementation of the abstract label and physical QoS ID separation architecture, the service flow category easy to understand can be presented to the user without sensing the hardware QoS ID related to the physical implementation, and the platform framework does not need to expose the physical implementation of the bottom layer to the common user, so that the usability can be improved, the system safety is improved, and the subsequent evolution change of the platform framework is convenient for the user without sensing. In the service flow identification method of the embodiment of the application, after the model is built, the GE module is utilized to convert the model into the universal AI graph node, and then a plurality of models in the AI graphic compiling stage are matched together to complete, and the related modules mainly comprise TensorFlow frames, a graph generation engine GE, a graph Fusion Engine (FE), a graph switching and model deployment subsystem (Model Distribution Subsystem, MDS) in the Davinci platforms.
When the python scripting language is used for constructing an AI network model, the TensorFlow framework provides a function called user-defined scope attribute, and when a script is written, a user can automatically add the attribute value to all graph nodes in the calculation process in a specified range through the function. For example, as shown in fig. 4B, fig. 4B is a diagram of a script for constructing an AI model resnet according to an embodiment of the present application, where 106 lines of the script are calculated using QoSLable designated by 105 as 1, and QoS Lable lines 119 to 124 are QoS Lable designated by 118 as 2. These QoS Lable values are predefined enumerated values, for example, on the Davinic platform, the enumerated values for the various traffic types are defined as follows:
QoSServiceLableType the corresponding value is defined by QoSManager:
In the execution stage of the graph, the Davinci platform framework invokes an internal table lookup function to convert the enumerated value into a QoS ID that can be identified and processed by hardware, and then all memory access requests carry the QoS ID, so that the bottommost memory access controller (i.e., the memory controller in the present application) determines the final execution policy of each memory access request according to the QoS policy configured by software and other statistical information.
Note that, qoS type marking is performed for different traffic flows in the script language by using a scope tag mode, and the method is only applicable to a TensorFlow framework, and is not necessarily applicable to other frameworks. To solve this problem, and thereby provide a more versatile way, the Davinci platform provides users with user-specified scope and QoS labeling capability in the specified computation process by providing a Python high-level API approach. In addition, in the Davinci platform framework, no matter what AI computing framework is, after the model is built, the AI computing framework needs to be parsed into a front-end expression through an intermediate parser, and then is compiled and optimized through a graph to become a specific computing task which can be finally executed on the Davinci platform. As shown in fig. 4C, fig. 4C is a schematic diagram of an executable computing task after compiling and optimizing a graph, which is provided by an embodiment of the present application, after constructing an AI network model by a deep learning development framework such as TensorFlow/Pytorch/ME (corresponding to ① in fig. 4C), and then converting the constructed AI network model into a generic AI graph node (corresponding to ④ in fig. 4C) by using a graph generating engine in the AI platform such as a GE module, the conversion Process may need to be optimized by a computing engine plug-in (Compute Engine Plugins) (corresponding to ② or ③ in fig. 4C), such as a neural network computing architecture CANN, a collection communication library HCCL, a preprocessing module Process, and so on, and finally resolving the AI network model into a computing task that can be executed in the AI system to execute (Runtim). In the process of compiling and converting the original front-end expression into a computing operator related to Davinci platform hardware, internal modules in the Daivinci platform can combine nodes in the graph, split or add communication nodes and other operations, such as Cache refreshing operation and the like. The nodes generated by operations performed by the Davinci platform in the background, the platform knows the types of various operations, the platform inherits the QoS labels of the father nodes of the newly added nodes for the newly added nodes, and for specific operations, such as Cache operations, the model parallel data communication service flows generated automatically by the model cross-slice cross-DIE deployment, and the platform also automatically inserts the predefined QoS labels.
For certain hardware accelerators, such as ISP, DVPP, ASP, GPU, the QoS ID is carried directly by these accelerators when a memory access is issued by way of direct configuration of hardware registers. For the memory access flow sent by the general-purpose CPU on the AI computing SOC, ARM CPU can be adopted to provide standard memory resource partitioning and monitoring (Memory System Resource Partitioning and Monitoring, MPAM) mechanism, and a linux operating system is used to provide standard API function interface to configure the QoS ID and QoS information of the CPU and the process thereof.
In the present application, when the Master in the AI SoC 20 subsystem receives a computation task to be executed, the computation task already carries a corresponding QoS ID, and an initial priority (i.e., a second QoS priority) corresponding to the QoS ID is already configured, and the AI system in the embodiment of the present application may further include: and continuously updating and optimizing access parameters matched with the QoS priorities of different QoS IDs. In the following, some embodiments of the present application are provided to describe how the AI system 01 or 02 updates and optimizes access parameters for each QoS ID during the entire system operation.
In one possible implementation, the host or the target Master is further configured to: and configuring a corresponding second QoS priority for the QoS ID in the computing task in advance, wherein the second QoS priority is an initial priority corresponding to the QoS ID. For example, the initial QoS priority to which each QoS ID matches may be configured by the Host (Host) or by a register in the target Master.
In the embodiment of the application, the host side or the inside of the target Master also configures an initial or source QoS priority (namely a second QoS priority) for each computing task, namely configures a matched QoS priority for each QoS ID. Therefore, the related modules in the subsequent AI SoCs can adjust the subsequent following QoS priority or the final QoS priority based on the initial QoS priority.
In one possible implementation, the host is further configured to: and updating and optimizing the second QoS priority corresponding to each QoS ID according to the access performance real-time monitoring information of the AI system. That is, the initial priority corresponding to the QoS ID can be constantly optimized. Optionally, according to the access performance real-time monitoring information of the AI system, updating and optimizing the second QoS priority corresponding to each QoS ID through an optimizing algorithm and a self-adaptive machine learning algorithm.
In the embodiment of the application, the host side in the AI system can also update and optimize the initial QoS priority corresponding to each QoS ID in the system according to the real-time monitoring information of the access performance. QoS auto-optimization is performed adaptively, for example, by an optimization algorithm and an adaptive machine learning algorithm.
The Davinci AI computing platform provided by the above embodiment is mainly applicable to AI training scenes and reasoning scenes, and provides memory access QoS control capability through the platform. The QoS control function mainly faces two difficulties in practical application, namely, how to accurately identify and classify traffic flows, which is referred to in the related embodiments related to classification and identification of traffic flow types. Another difficulty is how to efficiently obtain a set of optimal QoS configuration parameters that are adapted to the needs of the actual application. The method can provide a set of general solution for various service flows and various hardware accelerators under the conditions of different scale training networks, personalized machine computing power configuration, memory bandwidth configuration and inconsistent AI Server cluster scale. In order to solve the technical problem, the application provides a QoS automatic optimizing algorithm based on a machine learning algorithm. Referring to fig. 5A, fig. 5A is a schematic diagram of a software architecture for QoS automatic optimization according to an embodiment of the present application, and as shown in fig. 5A, the software architecture mainly includes a system performance real-time monitoring module 501, a system working environment and supporting parameter input module 502, a QoS optimizing algorithm module 503, a QoS configuration interface and driver 504, an optimizing algorithm termination indicating module 505, and an optimizing algorithm output module 506. Further, referring to fig. 5B, fig. 5B is a schematic flow chart of a method for QoS automatic optimization according to an embodiment of the present application, and the method flow in fig. 5B is described with reference to the related modules in fig. 5A, as shown in fig. 5A and fig. 5B.
1. The system performance real-time monitoring module 501 may collect, in real time, various key performance data during the running process of the AI network on the device, where the key performance data may specifically include:
(1) The time length of each iteration of the AI network in the training process;
(2) Key hardware resources, such as AI CORE utilization;
(3) Communication data stream time delay in the training process;
(4) Real-time bandwidth of the memory access controller.
2. The system working environment and matched parameter input module 502 can provide current working environment parameters for the algorithm by a configuration file mode, wherein the current working environment parameters specifically can include:
(1) The current maximum theoretical bandwidth of the hardware platform can achieve the actual utilization rate;
(2) Training all QoS IDs, default bandwidth high-low waterline and QoS values corresponding to the service flow types of the network;
(3) CPU, DVPP, ROCE, PCIE, ISP, GPU, default bandwidth high-low waterline and QoS value.
3. The QoS optimizing algorithm module 503 is configured to perform multiple iterative detection based on a bayesian machine learning algorithm, collect performance feedback data of the system after the QoS optimizing parameter is corrected, and finally output a set of optimal QoS operating parameters under a given operating environment, and record the optimal QoS operating parameters in a file, so as to be used as QoS configuration parameters in the subsequent formal operation of the system.
4. QoS configuration interface and driver 504: the system is used for providing a user-state API interface for the QoS optimizing algorithm so as to enable the supporting algorithm to adjust the bandwidth high-low waterline and the QoS priority which are configured on the memory access controller and correspond to each QoS ID in real time on line.
5. The optimizing algorithm termination indicating module 505 is configured to perform optimizing algorithm termination indication in the following cases, which may include the following specific cases:
(1) The iteration time jitter of N rounds reaches the design requirement: jitter within servers cannot exceed 0.1ms, jitter between servers does not exceed 0.5ms;
(2) The utilization rate of the key computing resources reaches the expected index: for example 95%;
(3) The throughput capacity of the system reaches the set index: for example resnet to 9200fps.
6. The optimization algorithm output module 506: for outputting the set of configurations to the QoS global planning configuration file after the optimal QoS configuration is reached through continuous iteration, for example, to the configuration file according to the following table format.
For example, in connection with the above-mentioned software architecture for QoS auto-optimization provided in fig. 5A, the following describes the interaction procedure between the software modules in the above-mentioned software architecture in connection with the method procedure in fig. 5B. Referring to a flow chart of a method for QoS automatic optimization provided in fig. 5B, the interaction flow shown in fig. 5B may at least include the following steps S501 to S508. Wherein, step S501: the QoS optimizing algorithm module 503 reads the environment parameters from the system working environment and matched parameter input module 502 and executes algorithm initialization; step S502: the QoS parameters (e.g., the initial QoS priority corresponding to the QoS ID, that is, the second QoS priority) of each QoS ID are issued to hardware (e.g., a memory controller) through the QoS configuration interface and driver 504; step S503: the system performance real-time monitoring module 501 collects system performance data according to performance collection intervals, configuration parameters and the like; step S504: the QoS optimizing algorithm module 503 performs noise filtering (such as gaussian filtering and median filtering) on the collected performance data; step S505: judging whether the system performance index reaches a stop optimizing condition (such as minimum mean square error of Toront iteration time length or entering a set threshold); step S506: if the condition of stopping optimizing is reached, the optimizing algorithm output module 506 stores the obtained optimal Qos parameter into a result file, and the optimizing algorithm termination indicating module 505 performs optimizing algorithm termination indication; step S507: if the condition of stopping optimizing is not reached, removing the solution of the unrealistic problem domain (such as Qos value range 0-7, and the bandwidth waterline of each QoS ID is unlikely to exceed the total bandwidth); step S508: the QoS optimizing algorithm module 503 calculates the next set of candidate QoS configuration parameters by using a bayesian prediction algorithm, and then re-issues relevant QoS parameters of each updated QoS ID to hardware (such as a memory controller) through the QoS configuration interface and the driver 504, that is, steps from step S502 to step S503 are performed again and iterated, and relevant QoS parameters in the embodiment of the present application are continuously optimized in the process of continuously iterating and iterating, it can be understood that the above-mentioned principle and flow of QoS automatic optimizing are also applicable to the process of updating and optimizing access policy control parameters corresponding to each QoS ID by MATA in the embodiment of the present application, and are not repeated here.
In the present application, when a certain computing task to be executed is scheduled into the AI SoC20 by the host 10 side in the AI system 01, it is involved in how each computing task to be executed is allocated to a target Master suitable for executing the computing task inside the AI SoC20, so the AI system in the embodiment of the present application may further include: the function of scheduling and assigning each calculation task. In connection with some embodiments of the present application, it is described in detail how the individual computing tasks are scheduled to the appropriate masters on the appropriate subsystems in AI system 01 or 02.
In one possible implementation, the system scheduler is configured to: receiving one or more computing tasks to be executed, which are sent by the host; each computing task to be executed also carries a task descriptor for describing the type of the computing task; selecting a matched subsystem from the M subsystems for each computing task to be executed according to the task descriptors carried in each computing task to be executed, and selecting a matched Master from one or more masters in the matched subsystems; and dispatching each computing task to be executed to the matched Master in the matched subsystem.
In the embodiment of the application, after the Host in the AI system identifies the service flow and carries the QoS ID, the computing tasks carrying the QoS ID are sent to the system scheduler on the AI SoC, the system scheduler can reasonably distribute all the computing tasks sent by the Host, and the specific distribution principle can be to distribute according to the task descriptor carried in the computing tasks so as to distribute proper subsystems and masters for each computing task according to the type of the task described by the task descriptor, thereby better completing the execution or acceleration of each computing task. For example, a certain AI matrix calculation task is assigned to the appropriate AI subsystem and to the idle Master on that AI subsystem.
In the present application, the AI system 03 in fig. 1C may also be applied to bandwidth isolation between virtual machine tenants in a virtualized scenario, that is, the AI system in the embodiment of the present application may further include: and isolating the memory bandwidth among different virtual machine tenants, promiseing the bandwidth and the like. The following describes how to implement the bandwidth isolation and access control functions between virtual machine tenants in the virtualized scenario in the AI system 03.
In one possible implementation, when the AI system is applied to a virtual scene, a plurality of virtual machines are included in the AI system, wherein each virtual machine in the plurality of virtual machines corresponds to one or more processes, and one process includes one or more computing tasks; the one or more processes run on one or more masters of at least one of the M subsystems; the system scheduler is further configured to: assigning a VM ID to each virtual machine; and the VM ID of the corresponding virtual machine is shared in the page table of one or more processes corresponding to each virtual machine.
In the embodiment of the application, when the AI system is applied to a virtual scene, a VM ID is allocated to each virtual machine by taking the virtual machine as a unit, and all processes under the virtual machine are set to correspond to the same VM ID, so as to isolate different virtual machines, thereby ensuring the safety isolation and mutual influence between users corresponding to different virtual machines.
In one possible implementation, when the AI system in the embodiment of the present application is a virtual scene, one process includes one or more computing tasks; when the system is a virtual scene, the target subsystem further comprises a system memory management unit SMMU; the target Master is further configured to: sending the memory access request of the computing task to the SMMU, and updating the QoS ID carried in the memory access request of the computing task through the SMMU; the SMMU is configured to: receiving a memory access request of the computing task sent by the target Master; determining a target process to which the computing task belongs according to a virtual address and a service set identifier SSID in the access request; and determining the VM ID of the target virtual machine corresponding to the target process according to the page table of the target process, and replacing the QoS ID carried in the access request of the computing task with the VM ID of the target virtual machine. Specifically, there are many processes in each virtual machine, and each virtual machine process may further include many computing tasks, that is, one process includes many computing tasks. The QoS ID carried in the computing task may be replaced with the VM ID of the virtual machine by first determining to which process (typically 32) the computing task belongs, and then looking at which virtual machine the process corresponds to. It should be noted that, when the AI system is applied to the virtual scenario, the QoS ID carried by each computing task is replaced in the subsystem, after the replacement is completed, all the processes from the subsystem to the Soc bus, from the Soc bus to the MATA, and from the MATA to the memory controller may be kept consistent with those in the non-virtual scenario, that is, when the initial QoS ID carried in the computing task is replaced with the VM ID of the virtual machine to which the computing task belongs, the processing manner in the non-virtual scenario may still be used later, for example, the QoS priority corresponding to a certain QoS ID is temporarily adjusted by a sub-scheduler in the subsystem, for example, the temporarily adjusted QoS priority of the sub-scheduler is recovered by the Soc bus, for example, the final confirmation optimization is performed on the QoS priority corresponding to the QoS ID carried in the access request by the MATA, which may be specifically described in the related embodiments in fig. 1A to 5B, which will not be described herein.
In the embodiment of the application, when the AI system is in a virtual scene, the original QoS ID allocation and the flow process are replaced, and QoS IDs are uniformly replaced by the virtual machine to which the process belongs, namely, each Master replaces the QoS ID carried in the received memory request through the SMMU in the Master, and uniformly replaces the VM ID of the virtual machine corresponding to the process to which the computing task corresponding to the memory request belongs, so that the primary aim of taking bandwidth safety isolation as much as possible under the scene is to meet the basic requirements that data isolation, computing resource isolation and mutual influence are not carried out among users of the virtual machine. Furthermore, the memory bandwidth isolation and bandwidth commitment problems among users of different virtual machines can be solved.
In one possible implementation, the AI SoC further includes an L2 Cache; the L2 cache is configured to: and receiving access requests of all the computing tasks, and accessing the corresponding storage areas in the L2 Cache according to QoS IDs carried in the access requests of all the computing tasks, wherein the access requests carrying different QoS IDs correspond to different storage areas in the L2 Cache.
In the embodiment of the application, the storage area of the cache which can be accessed by each access request can be controlled through the QoS ID carried in each access request, namely, the corresponding storage area in the cache is safely isolated through the QoS ID in the access request. Because the process under each virtual machine has the corresponding virtual machine ID, namely the VM ID, the VM ID to which the process belongs can be carried as the QoS ID in the corresponding access request, so that the Cache can be isolated based on the QoS ID to realize the safety isolation effect under the virtual machine scene.
In the above-mentioned virtualization scenario, the powerful computing power of the DAVINCI AI platform may be split into multiple independent computing functional units, which are generally called computing power and resource independent virtual processes (VFs), and through hardware support virtualization specifications (Single Root I/O Virtualization, SR-IOV) and mature software virtualization technologies, such as Qumeu +kvm, a virtual machine with AI computing power may be provided to a client through cloud services. User data isolation and computing power resource isolation among different virtual machine users are basically required, and anomalies do not affect each other. The embodiment of the application introduces the QoS control technology of the memory access into the virtualized scene, and can effectively solve the problems of memory bandwidth isolation and bandwidth commitment among different virtual machine tenants.
In the virtual machine scenario, in order to prevent malicious attacks, an application on the virtual machine is not trusted to carry QoS ID information in the SQE, but may carry different QoS information based on a service flow in the SQE, and after the QoS ID information carried in the SQE reaches a DEVICE (DEVICE) side, the QoS ID configured in the SMMU accessed by the virtual machine, that is, the VM ID, is replaced.
Referring to fig. 6A, fig. 6A is a schematic software architecture diagram of an AI system in a virtual scenario provided by an embodiment of the present application, in the framework of the software stack shown in fig. 6A, the framework of the software stack is mainly divided into a HOST side (i.e. the HOST 10 side in the present application) and a DEVICE side (i.e. the AI SoC20 side in the present application), and the software modules and their corresponding functions related to the interior of the HOST side and the DEVICE side are described respectively. The software modules related to the QoS access control function of the AI system in the virtual application scene in the application specifically comprise the following steps:
HOST side
1. Graph switching and model deployment subsystem/graph generation engine/quality of service management library (MDS/GE/HCCL/libQoSManager)
The functions of the several modules are completely consistent with those of the bare metal scene, and the only difference is that each virtual has own QoS global configuration table, and the virtual machines do not influence each other.
2. Quality of service adaptation Tools (QoS Adjust Tools)
(1) The tool is provided for the virtual machine user, and is used for supporting the user to adjust the QoS priority of different service flows by self, after the adjustment is finished, the adjustment result is stored in the QoS configuration table of the virtual machine, and QoS in the table is inquired by libQoSManager and returned to GE/HCCL for use.
(2) The user can use the tool to query the bandwidth use condition of the user of the virtual machine.
(3) The tool does not support QoS ID based bandwidth and QoS configuration delivery functions.
(II) DEVICE side
3. Virtual machine PROCESS (VM PROCESS)
After each virtual machine user process is started, a process is correspondingly started on the DEVICE side, after the process is started, an SSID (service set identifier) on an accelerator to which the process is bound and a VMID (virtual machine ID) of the attributed virtual machine can be obtained, and based on the SSID and the VMID, qoSDriver is notified through an IOCTL interface provided by QoSDriver.
4. Quality of service Driver (QoS Driver)
QoSDriver are basically consistent with the bare metal scenario, but in the virtualized scenario, the implementation of the provided commands of the IOCTL is different from that of the bare metal, and the main differences are in the implementation of the QoS ID configuration flow, which is described in detail as follows:
(1) When a virtual machine process is started, a QoS ID is allocated for the virtual machine, and a plurality of virtual machine processes share the same QoS ID;
(2) According to the resource configuration when the virtual machine is created, calling an interface of a device management driver to obtain the maximum bandwidth of the virtual machine, and calling an interface QoSConfig to configure the bandwidth of the QoS ID into a register of an MATA;
(3) QoSDriver, adding the QoS ID of the virtual machine into a Monitor to acquire actual bandwidth usage data;
(4) When the virtual machine process exits, reverse QoS resource release and cleaning work, such as QoS ID reclamation, is required to be performed by the IOCTL interface call QoSDriver, statistics on the QoS ID are deleted from the Monitor, and bandwidth and priority allocated to the QoS ID in MATA are cleared.
5. Shared virtual memory (Share Virtual Memory, SVM 0) module
The SVM0 can be responsible for unified and centralized management of accelerator device drivers used by the virtual machine, the module is mainly used for realizing sharing of virtual addresses between a kernel and user state processes, and is a module realized by the kernel, qoS drivers are used for traversing device drivers of all accelerators used by the virtual machine in the module by calling interfaces provided by the module, qoS ID setting interfaces provided by SMMU DRV are called, and QoS IDs are configured in a CD table of the SMMU for a master for accessing HBM memory by using the virtual addresses.
6. System memory management unit driver (SMMU DRV)
The SMMU DRV is driven by the SMMU provided by the kernel, and provides QoS ID functionality in the CD entries configuring the SMMU. In this scheme, each virtual machine has only one unique QoS ID on each SOC, and the SSID corresponding to all processes is configured to be the same QoS ID in the CD table of the SMMU, regardless of how many processes the virtual machine has.
7. High bandwidth memory driver (HBM DRV)
The HBM DRV may provide an interface for the QoSDriver module, through which the current HBM effective theoretical total bandwidth of the SoC may be obtained, and the HBM driver may need to consider different HBM capacity configuration conditions, different HBM channel enabling conditions, different working main frequency conditions, and actual conditions such as PG, FG, etc. obtained after chip screening, to calculate an accurate current theoretical bandwidth.
8. Equipment side management Module (DEVICE MANAGE Module Devmm)
The module is a device side device management driver (Devmm), which can provide a query interface for QoSDriver, can query the calculation power proportion of a given VF, and a VM may have multiple VFs, and qostricver needs to add the calculation power proportion of all the VFs of the VM to calculate the HBM bandwidth proportion that should be allocated by the VM, and then, configure the bandwidth waterline of the QoS ID corresponding to the VM in MATA according to the total bandwidth returned by the HBM.
9. Resource Control (RESCTRL) module
For the Load/Store access to the HBM directly initiated by the AI CPU, a MPAM mechanism-based resource isolation function, i.e., RESCTRL function, provided by the Linux OS kernel is required. Based on the functional interface, the Linux OS can configure QoS IDs for different processes or process groups, and then when the OS scheduler switches to the process for execution, the CPU MPAM register is configured according to the QoS ID allocated by the process, so that a read-write request of the HBM sent on the CPU is implemented, and each process is carried with the QoS ID allocated in advance. The RESCTRL module may provide a QoS ID setup interface for the corresponding process of the virtual machine at DEVICE. Before a process on the DEVICE side of the virtual machine calls the interface, the interface provided by the QoSDriver module needs to be called in advance to acquire the QoS ID of the process.
Referring to fig. 6B, fig. 6B is a schematic diagram of an interaction flow between each software module in an AI system under a virtual application scenario according to an embodiment of the present application, and based on the software architecture in fig. 6A, a specific execution flow of each module in the software architecture in a virtualization scenario is described, which may specifically include the following flows:
(1) After the service quality driver (QoSDriver) is started, the HBM driver is called to provide an interface, and the current total bandwidth of the SoC is obtained.
(2) A plurality of processes may be started in one VM, each time a process is started by the HOST side virtual machine, a corresponding virtual machine process is started on the DEVICE side, after the process is started, qoS driving equipment is required to be started, IOCTL commands are called, SSID and VMID are transmitted, and VFID is given to QoSDriver; to apply for an available QoS ID to the process, note that all processes belonging to the same VM share a QoS ID.
(3) The IOCTL module in QoSDriver searches devmng for a driver according to the VM ID of the virtual machine, and obtains the calculation ratio example configuration of the virtual machine; if the virtual machine has multiple VFs, the calculation ratios of all the VFs of the virtual machine need to be added, and then a bandwidth allocated to the virtual machine is calculated according to the total calculation ratio of the virtual machine and the total bandwidth obtained from the HBM.
(4) The virtual machine process also needs to call RESCTRL interface provided by the operating system, and QOS ID obtained from QoSDriver is set to the operating system through the interface so as to inform the OS of QoS ID of the process; when the OS scheduler schedules the task to execute, the QoS ID of the process is configured into MPMAM registers of the CPU, so that the AI CPU process carries the correct QoS ID on the LOAD/STORE operation of the HMB.
(5) QoSDriver assigns a QoS ID to the virtual machine.
(6) QoSDriver obtaining the SSID and the QoS ID, then, the interface of the SVM0 needs to be called to find out the instance of the kernel mode driving device data structure of all master devices used by the virtual machine.
(7) And then, calling a configuration interface provided by the SMMU DRV to configure QoS IDs in CD table entries of the SMMUs of the masters.
(8) QoSDriver also adds the QoS ID of the virtual machine to the Monitor to collect the actual bandwidth usage data.
(9) Finally, the interface of QoSConfig is called to configure the bandwidth of the QoS ID into the register of the MATA.
(10) When the virtual machine process exits, reverse QoS resource release and cleaning work is required to be executed through IOCTL interface call QoSDriver.
(11) If the QoS ID is recovered, the statistics of the QoS ID are deleted from the Monitor, and the bandwidth and the priority allocated to the QoS ID in the MATA are cleared.
(12) As for the QoS ID configuration in the CD entry corresponding to the SMMU of each Master, the SMMU driver automatically perceives the exit of the virtual machine process back and automatically completes the relevant cleaning release work.
Referring to fig. 7, fig. 7 is a flow chart of a memory access control method provided by an embodiment of the present application, where the memory access control method is applied to an artificial intelligence AI system, the AI system includes an AI system on a chip SoC, the AI SoC includes M subsystems and N memory controllers, and the M subsystems and the N memory controllers are interconnected through an SoC bus; the M subsystems comprise target subsystems, the target subsystem is any one subsystem of the M subsystems, the target subsystem comprises S Master, and M, N, S are integers which are larger than or equal to 1; the memory access control method is applicable to any one of the AI systems shown in fig. 1A-1C and devices (such as mobile phones, computers, servers, etc.) including the AI systems. The method may include the following step S701-step S702, wherein,
Step S701: receiving a calculation task to be executed through a target processing node in the S processing nodes, wherein the calculation task carries a quality of service identifier (QoS ID); generating a memory access request of the computing task, wherein the memory access request carries the QoS ID; and sending the access request to a target memory controller in the N memory controllers.
Step S702: receiving, by the target memory controller, the access request, and determining a first quality of service QoS priority corresponding to the QoS ID; and performing access QoS control on the access request based on the first QoS priority.
In a possible implementation manner, the computing task further carries a second QoS priority corresponding to the QoS ID, where the second QoS priority is an initial QoS priority corresponding to the QoS ID in the computing task.
In a possible implementation manner, the target subsystem further comprises a sub-scheduler; the sending, by the target Master, the access request to the target memory controller of the N memory controllers includes: the access request is sent to the sub-scheduler through a target Master, and is scheduled to a target memory controller in the N memory controllers through the sub-scheduler; the method further comprises the steps of: receiving access requests sent by the S Master in the target subsystem respectively through the sub-scheduler; scheduling the access requests sent by the S masters to the SoC bus according to second QoS priorities corresponding to QoS IDs carried in the access requests sent by the S masters respectively, wherein the second QoS priorities are initial QoS priorities corresponding to the QoS IDs; the second QoS priority is used to indicate a priority of the corresponding access request being scheduled to the SoC bus.
In a possible implementation manner, the scheduling, by the sub-scheduler, the access requests sent by the S masters respectively to the SoC bus according to the second QoS priorities corresponding to the QoS IDs carried in the access requests sent by the S masters respectively includes: respectively establishing task queues for the S masters through the sub-schedulers, wherein each task queue comprises a memory access request sent by the corresponding Master; wherein, the target Master corresponds to a target task queue; when a target access request is currently inserted into the target task queue, respectively lifting the second QoS priorities corresponding to the QoS IDs carried in all the access requests in the target task queue to third QoS priorities, wherein the target access request is an access request with the second QoS priorities corresponding to the QoS IDs exceeding a preset priority; and according to the second QoS priority or the third QoS priority corresponding to the QoS ID carried in the memory access requests in the task queues of the S masters, the memory access requests in the task queues of the S masters are sequentially sent to the SoC bus.
In one possible implementation, the method further includes: receiving one or more access requests in the target task queue sent by the sub-scheduler through the SoC bus, wherein the one or more access requests comprise the access request; and restoring the third QoS priority corresponding to the QoS ID carried in one or more access requests in the target task queue to the corresponding second QoS priority.
In one possible implementation, the method further includes: and scheduling one or more memory requests in the target task queue to corresponding memory controllers in the N memory controllers respectively based on the second QoS priority after the recovery of the one or more memory requests in the target task queue through the SoC bus.
In one possible implementation, the AI SoC further includes an advanced memory access agent MATA: the scheduling, by the SoC bus, the one or more memory access requests in the target task queue to corresponding memory controllers in the N memory controllers includes: and sending one or more memory access requests in the target task queue to the MATA through the SoC bus, and respectively scheduling the one or more memory access requests to corresponding memory controllers in the N memory controllers through the MATA. Optionally, in another possible implementation manner, the AI SoC further includes an advanced memory access agent MATA: the SoC bus is specifically configured to: and sending the memory access requests respectively sent by the S masters to the MATA, and respectively dispatching the memory access requests respectively sent by the S masters to corresponding memory controllers in the N memory controllers through the MATA, wherein the memory access requests respectively sent by the S masters comprise the memory access requests.
In one possible implementation, the method further includes: receiving the access request through the MATA, and determining the second QoS priority corresponding to the QoS ID carried in the access request; determining the first QoS priority corresponding to the QoS ID based on the second QoS priority corresponding to the QoS ID, in combination with historical memory bandwidth statistics corresponding to the QoS ID, and a memory policy control parameter corresponding to the QoS ID, where the memory policy control parameter includes one or more of a highest bandwidth, a lowest bandwidth, and an access priority that allow an access request to pass.
In one possible implementation, the method further includes: presetting access policy control parameters corresponding to each QoS ID through the MATA, and counting and recording historical memory bandwidths corresponding to each QoS ID; and updating and optimizing the access policy control parameters corresponding to each QoS ID according to the access performance real-time monitoring information of the AI system.
In one possible implementation, the method further includes: and carrying the first QoS priority in the access request through the MATA, and scheduling the access request to the target memory controller based on the first QoS priority. Optionally, in another possible implementation manner, the AI SoC further includes a MATA; the method further comprises the steps of: and carrying the determined first QoS priority in the access request through the MATA, and scheduling the access request to the target memory controller based on the first QoS priority.
In one possible implementation manner, the performing, by the target memory controller, access QoS control on the access request based on the first QoS priority includes: and performing access QoS control on the access request by the target memory controller based on the first QoS priority corresponding to the QoS ID and combining the access service condition of the target memory controller, wherein the access service condition comprises a memory access time sequence requirement or a memory bandwidth bus utilization rate.
In one possible implementation, the method further includes: and when the amount of the access requests received by the target memory controller is greater than a preset threshold value, broadcasting back pressure indication to the M subsystems through the target memory controller, wherein the back pressure indication is used for indicating one or more subsystems in the M subsystems to delay, reduce or stop sending the access requests.
In one possible implementation, the AI system further includes a host; the method further comprises the steps of: receiving a task to be executed through the host, and splitting the task to be executed into one or more computing tasks to be executed; identifying the service flow type of the one or more calculation tasks to be executed after splitting according to a preset service flow label table, wherein the preset service flow label table comprises a mapping relation between the service flow type of the predefined calculation task and QoS ID; and carrying corresponding QoS IDs for the one or more computing tasks to be executed respectively according to the identification result.
In one possible implementation, the AI SoC further includes a system scheduler; the method further comprises the steps of: and sending, by the host, one or more computing tasks carrying the corresponding QoS IDs to the system scheduler.
In one possible implementation, the method further includes: and configuring a corresponding second QoS priority for the QoS ID in the computing task in advance by the host or by the target Maser, where the second QoS priority is an initial priority corresponding to the QoS ID.
In one possible implementation, the method further includes: and updating and optimizing the second QoS priority corresponding to each QoS ID through the host according to the access performance real-time monitoring information of the AI system.
In one possible implementation, the method further includes: receiving, by the system scheduler, the one or more computing tasks to be executed sent by the host; each computing task to be executed also carries a task descriptor for describing the type of the computing task; selecting a matched subsystem from the M subsystems for each computing task to be executed according to the task descriptors carried in each computing task to be executed, and selecting a matched Master from one or more masters in the matched subsystems; and dispatching each computing task to be executed to the matched Master in the matched subsystem.
In one possible implementation, when the AI system is applied to a virtual scene, a plurality of virtual machines are included in the AI system, wherein each virtual machine in the plurality of virtual machines corresponds to one or more processes, and one process includes one or more computing tasks; the one or more processes run on one or more masters of at least one of the M subsystems; the method further comprises the steps of: allocating a VM ID for each virtual machine through the system scheduler; and the VM ID of the corresponding virtual machine is shared in the page table of one or more processes corresponding to each virtual machine.
In one possible implementation, when the system is a virtual scene, the target subsystem further includes a system memory management unit SMMU; the method further comprises the steps of: sending a memory access request of the computing task to the SMMU through the target Master, and updating the QoS ID carried in the memory access request of the computing task through the SMMU; receiving a memory access request of the computing task sent by the target Master through the SMMU; determining a target process to which the computing task belongs according to a virtual address and a service set identifier SSID in the access request; and determining the VM ID of the target virtual machine corresponding to the target process according to the page table of the target process, and replacing the QoS ID carried in the access request of the computing task with the VM ID of the target virtual machine.
In one possible implementation, the AI SoC further includes an L2 Cache; the method further comprises the steps of: and receiving access requests of all computing tasks through the L2 Cache, and accessing corresponding storage areas in the L2 Cache according to QoS IDs carried in the access requests of all computing tasks, wherein the access requests carrying different QoS IDs correspond to different storage areas in the L2 Cache.
It should be noted that, for a specific flow of the memory access control method described in the embodiment of the present application, reference may be made to the related description in the application embodiment described in fig. 1A to fig. 6B, which is not repeated herein.
The embodiment of the present application also provides a computer readable storage medium, where the computer readable storage medium may store a program that includes some or all of the steps described in any one of the above method embodiments when executed by an AI system.
The embodiment of the application also provides a computer program, which comprises instructions, when the computer program is executed by the AI system, the AI system can execute part or all of the steps of any memory access control method.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc., in particular may be a processor in the computer device) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. Wherein the aforementioned storage medium may comprise: various media capable of storing program codes, such as a U disk, a removable hard disk, a magnetic disk, a compact disk, a Read-Only Memory (abbreviated as ROM), or a random access Memory (Random Access Memory, abbreviated as RAM), are provided.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.