CN114995882B

Movatterモバイル変換

Info

Publication number: CN114995882B
Application number: CN202210852711.6A
Authority: CN
Inventors: 曾臻
Original assignee: Muxi Integrated Circuit Shanghai Co ltd
Current assignee: Muxi Integrated Circuit Shanghai Co ltd
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-11-04
Anticipated expiration: 2042-07-19
Also published as: CN114995882A

Abstract

The invention relates to the technical field of packet processing methods, in particular to a method for packet processing of a heterogeneous structure system, which comprises the following steps: after the read command packet is returned, the returned command packet is added into a command queue of a reordering cache unit, the reordering cache unit analyzes the command packet, and when the command packet is an dispatch packet, a virtual address of a process object carried in the dispatch packet is sent to a pre-fetching scheduling module; a pre-fetching scheduling module acquires a process object; the reordering buffer unit adds the acquired process object into a process object queue of the reordering buffer unit and informs the microprocessor that the command packet is ready; the microprocessor analyzes the command packet, analyzes the corresponding process object when the command packet is the dispatch packet, and further completes the dispatch and execution of the corresponding task, solves the problem that the microprocessor still needs to wait for translating the physical address of the process object and reading the process object for a long time when analyzing the dispatch packet, and improves the operating efficiency of the microprocessor.

Description

Heterogeneous structure system systematic processing method

Technical Field

The invention relates to the technical field of package processing, in particular to a method for package processing of a heterogeneous structure system.

Background

An HSA heterogeneous system protocol defines an AQL (architecture Queuing structured packet), where the AQL packet is placed in a RING command packet queue (RING BUFFER) by a CPU, the RING command packet queue actually stores a command packet queue in a system main memory or a device memory, the AQL packet has packets in various packet formats, where a dispatch packet (kernel dispatch) is a command packet that the CPU notifies a device to execute a process task, each dispatch packet has a virtual address of a process object, the actual process object is stored in the system main memory or the device memory, and the process object includes a start address of a process command execution, an offset address of starting execution, and configuration parameters necessary for executing a process, such as hardware resource configuration.

In the HSA heterogeneous system, a Central Processing Unit (CPU) is responsible for control, including task assignment and complex control processing; the plurality of asynchronous devices execute calculation according to the CPU command; the command control unit is responsible for receiving and splitting the CPU command and issuing the command to the execution unit; specifically, a hardware processing module in the command control unit notifies a microprocessor after acquiring a command packet, the microprocessor acquires a process object according to a virtual address of the process object carried by the dispatch packet when analyzing the command packet as the dispatch packet, the work group dispatch unit configures a register for the work group dispatch unit after acquiring the process object, the work group dispatch unit divides the process into a plurality of work groups and then sends the work groups to an idle calculation execution unit, and the calculation execution unit executes a process instruction after reading a corresponding process instruction and data according to the register.

For the traditional packet processing method, a microprocessor in a command control unit needs to acquire and analyze a command packet firstly, if the command packet is dispatched, a process object needs to be read, and when the process object is read, virtual address translation and process object acquisition need to be carried out, wherein the virtual address translation needs to pass through the cache of a multi-level virtual address conversion unit, the process object acquisition needs to read a device memory and even a CPU side system main memory through the multi-level cache, the time consumption is very long, particularly, the time required for addressing to a CPU side physical address management unit is longer if the multi-level cache matching does not occur in the virtual address to physical address process, and therefore, the address translation process occupies a certain time of the microprocessor; after the physical address of the process object is obtained, the process object needs to be obtained from the device memory or the system main memory, which also consumes a large amount of time; since the microprocessor needs to wait until the current packet is processed before processing the next packet, the microprocessor is largely occupied by the processes of translating the physical address of the process object and reading the process object from the acquisition command packet, and the microprocessor can only wait until the process object is obtained, so that the microprocessor has a large amount of idle waiting time, and further the processing efficiency of the microprocessor is low.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method for systematic processing of heterogeneous structures, which adopts the following technical solutions:

a method for heterogeneous structure system package processing is applied to a command control unit of asynchronous equipment, the command control unit comprises a microprocessor and a hardware processing module, the hardware processing module comprises a prefetch scheduling module, an address conversion unit, a reorder buffer unit and a work group dispatching unit, and the method comprises the following steps: after the read command packet is returned, adding the returned command packet into a command queue of a reordering cache unit, analyzing the command packet by the reordering cache unit, and sending a virtual address of a process object carried in a dispatch packet to a pre-fetching scheduling module when the command packet is the dispatch packet; the pre-fetching scheduling module sends the virtual address of the process object to an address translation unit to obtain the physical address of the process object, and initiates a process object reading command to a multi-level cache unit of the equipment; the reordering buffer unit adds the process object returned by the multi-level buffer into the process object queue of the reordering buffer unit and informs the microprocessor that the command packet is ready; the microprocessor analyzes the command packet, and analyzes a corresponding process object when the command packet is a dispatch packet, wherein each process comprises a plurality of working groups; the microprocessor configures a register for the work group dispatching unit, the work group dispatching unit divides the process into a plurality of work groups and then sends the work groups to the idle calculation execution unit, and the calculation execution unit reads corresponding process instructions and data according to the register and then executes the process instructions.

The embodiment of the invention has the following beneficial effects:

according to the embodiment of the invention, the command packet and the process object corresponding to the dispatch packet are prefetched in advance before the microprocessor analyzes the packet, the command packet is added into the command queue, the process object is added into the corresponding process object queue, the corresponding process object can be obtained while the microprocessor analyzes the dispatch packet, and then the dispatch and execution of the corresponding task are completed, so that the problems that the microprocessor needs to wait for translating the physical address of the process object and read the process object according to the physical address when the microprocessor analyzes the dispatch packet in the prior art are solved, and the operating efficiency of the microprocessor is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a block diagram of a heterogeneous architecture system;

fig. 2 is a flowchart illustrating steps of a packet processing method based on a heterogeneous structure system according to an embodiment of the present invention;

fig. 3 is a flowchart of a method for processing a packet based on a heterogeneous structure system according to an embodiment of the present invention;

FIG. 4 is a diagram of a conventional packet processing process;

FIG. 5 is a diagram illustrating a packet processing procedure according to an embodiment of the present invention;

FIG. 6 is a flow chart illustrating a process for reading a package and a process object according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the packet processing method based on the heterogeneous structure system according to the present invention with reference to the accompanying drawings and preferred embodiments shows the following detailed descriptions of the specific implementation, structure, features and effects thereof. In the following description, the different references to "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following describes a specific scheme of the packet processing method based on the heterogeneous structure system provided by the present invention in detail with reference to the accompanying drawings.

Referring to fig. 1, a structure of a heterogeneous structure system applied in an embodiment of the present invention is shown, where the heterogeneous structure system includes two major parts: a central processing unit and a plurality of asynchronous devices. The Central Processing Unit (CPU) is responsible for control and is used for dispatching tasks and performing complex control processing; the asynchronous device performs calculations according to the commands of the CPU.

The central processing unit comprises a CPU system main memory and an address management unit. A CPU-initiated program may be broken down into tens of thousands of command queues, where several command queues are stored in the CPU main memory or the device memory, where each command queue is a packet queue made up of several packets that are mapped onto a command queue in a queue group in the command control units of several asynchronous devices. The address management unit is responsible for management and interpretation of the final physical address.

Each asynchronous device includes a command control unit, a plurality of parallel computing execution units, a virtual address translation secondary cache, a multi-level cache, a PCIE interface, an on-chip high-bandwidth memory, and a resource allocation module (not shown in the figure). The command control unit is responsible for receiving and splitting commands and issuing the commands to the parallel computing execution unit; the parallel computing execution unit is used for executing corresponding tasks; the virtual address secondary cache is used as a cache for temporarily exchanging data between the virtual address primary cache and an address management unit at the CPU side; the multi-level cache is a cache of a data reading channel; the on-chip high-width memory is a memory on the current asynchronous equipment side and is also called an equipment memory; the PCIE interface is a high-speed serial computer expansion bus standard, and the central processing unit and the asynchronous equipment communicate through the PCIE interface.

The command control unit comprises a plurality of microprocessors and hardware processing modules, the microprocessors are functional modules realized by software programming, and the hardware processing modules are formed by hardware circuits. The hardware processing module comprises a pre-fetching scheduling module, a virtual address translation first-level cache, a reading module, a plurality of work group dispatching units and a reordering cache unit for storing a command queue group. Because a microprocessor can serve several processes at the same time, and each process is disassembled into several command queues, a pre-fetching scheduling module is needed to schedule the corresponding command packet, and meanwhile, in order to solve the problem that the microprocessor is occupied for a long time, the process object also needs to be pre-fetched in advance in the embodiment of the invention, so the pre-fetching scheduling module also comprises the pre-fetching and scheduling of the process object. The virtual address translation primary cache, the virtual address translation secondary cache and the address management unit form an address translation system for translating virtual addresses into physical addresses. The reading module is mainly used for reading out the packet or the process object through the multi-level cache according to the obtained physical address. The reordering buffer unit is mainly used for temporarily storing the command queue and the process object queue. The work group dispatching unit is mainly used for configuring corresponding registers and grouping dispatching tasks.

Specifically, referring to fig. 2, after the CPU notifies the packet queue information corresponding to the pre-fetch scheduling module, the pre-fetch scheduling module generates a packet read virtual address, the address translation system translates the packet virtual address into a physical address, the read module initiates a packet read command to the multi-level cache unit, the multi-level cache stores the read packet from the system main memory on the CPU side and returns the read packet to the reordering cache unit, and the read command packet is added to the corresponding command queue; after the read command packet is returned, when the reordering cache unit judges that the returned command packet is a dispatch packet, the dispatch packet is analyzed to obtain a virtual address of the process object, the virtual address of the process object is sent to the pre-fetching scheduling module, and the address translation system carries out address translation on the virtual address of the process object to obtain a physical address of the process object; the reading module initiates a reading command of the process object to the multi-level cache, and the multi-level cache reads the process object from a main memory or an equipment memory of the system and returns the process object to the reordering cache unit; the reordering buffer unit adds the process object into the process object queue and informs the microprocessor that the packet is ready; the microprocessor analyzes the command packet, and analyzes the process object in the dispatching packet when the identification carried in the command packet is the dispatching packet; the dispatch package and the process object thereof are read before the microprocessor analyzes, so that after the microprocessor acquires the dispatch package and the corresponding process object thereof, the dispatch register is configured for the workgroup dispatch unit, the workgroup dispatch unit divides the process into a plurality of workgroups and issues the workgroups to the idle calculation execution unit, and the calculation execution unit reads process instructions and data from the main memory and the equipment memory of the system according to the dispatch register and executes corresponding instructions. The resource allocation module is used for selecting an idle calculation execution unit to send down a working group.

Referring to fig. 3, a flowchart of a packet processing method based on a heterogeneous structure system according to an embodiment of the present invention is shown, where the packet processing method is applied to a command control unit of an asynchronous device, where the command control unit includes a microprocessor and a hardware processing module, and the method includes the following steps:

step S001, after the read command packet is returned, adding the returned command packet into a command queue of a reordering cache unit, analyzing the command packet by the reordering cache unit, and sending a virtual address of a process object carried in a dispatch packet to a pre-fetching module when the command packet is the dispatch packet; the pre-fetching module sends the virtual address of the process object to the address translation unit to obtain the physical address of the process object and obtain the process object; the reordering buffer unit adds the acquired process object into the process object queue of the reordering buffer unit, and informs the microprocessor that the command packet is ready.

The AQL packets in the HSA heterogeneous system have various forms, such as dispatch packets, custom packets, and isochronous packets, and the tasks issued by the CPU are mainly completed by the dispatch packets, the dispatch packet ratio is the largest among the command packets in the command queue, only the dispatch packet in each command packet includes the virtual address of the process object, and the actual process object is stored in the device memory or the system main memory.

In the conventional packet processing method, a microprocessor needs to wait for the return of a request for reading a process object, and the reading of the process object needs to be performed through the conversion from a virtual address to a physical address and the long time of reading from a device memory through a multi-level cache. Specifically, the analysis of the dispatching package and the acquisition of the process object are processed by the hardware processing module, so that in the process of dispatching the previous command package by the microprocessor, the hardware processing module can analyze the package of the next dispatching package and acquire the process object in advance; after the microprocessor processes the dispatch of the previous command dispatch package, the microprocessor can directly analyze the hardware processing module and obtain the next dispatch package of the process object for new dispatch; because the process object and the package are pre-fetched through the hardware processing module, the microprocessor does not need to wait for the process object, and can directly process the dispatch package, the execution efficiency of the computing execution unit for completing the small task dispatch package can be greatly improved, the high efficiency of the hardware processing module for analyzing and processing the package can be fully exerted, the dispatch package is delivered to the hardware processing module for accelerated processing, and the processing time required by the package analysis and the package processing is greatly shortened.

Referring to fig. 6, the process of reading the command packet is almost the same as the process of reading the process object, and it is necessary to convert the virtual address of the packet or the process object into a physical address and then read the corresponding packet or the process object according to the physical address. The process of converting the virtual address into the physical address comprises the following specific steps: addressing in a first-level cache of virtual address translation according to the generated virtual address, and returning a physical address if matching is successful; if the matching fails, the inquiry to the virtual address translation secondary cache is needed, and if the matching of the virtual address translation secondary cache also fails, the address is addressed to the address management unit at the CPU side, so that the final physical address can be obtained. The process for reading a package or process object includes: after the corresponding physical address is obtained, the process object is read and returned through the memory of the multi-level cache access device according to the physical address, or the main memory of the system is accessed through the multi-level cache to read and return the packet.

The reordering buffer unit comprises a plurality of command queues and process object queues, each command queue comprises a plurality of command packets, and each process object queue comprises a plurality of process objects. And each queue follows a first-in first-out principle, the microprocessor reads the command packets in the command queue according to a natural sequence, and reads the process objects in the process object queue according to the natural sequence when the command packets are dispatch packets, wherein the command queue and the process object queue follow the first-in first-out principle. For example, the command queue is added according to the order of virtual address generation of the command packet, specifically, the order of virtual address generation is:packet 1 virtual address, packet 2 virtual address, packet 3 virtual address, … …, packet n virtual address; the command queue is:pack 1, pack 2, pack 3, … …, pack n; correspondingly, ifpacket 1, packet 3 … …, packet m is an dispatch packet, where m is less than n, then the process object queue is:package 1, package 3, … … and package m, and the arrangement sequence of the process objects in the process object queue is the same as the arrangement sequence of the dispatch packages in the command queue. Even if the packets and the process objects are returned out of order when returned, the reordering cache unit adds the packets and the process objects returned out of order to the corresponding queues according to the natural order.

After the dispatch package and the corresponding process objects are respectively placed into the corresponding queues, the microprocessor is informed that the package is ready, so that the occupied time of the microprocessor can be reduced.

Step S002, the microprocessor analyzes the command packet, and analyzes the corresponding process object when the command packet is a dispatch packet, wherein each process comprises a plurality of working groups; the microprocessor configures a register for the work group dispatching unit, the work group dispatching unit divides the process into a plurality of work groups and then sends the work groups to the idle calculation execution unit, and the calculation execution unit reads corresponding process instructions and data according to the register and then executes the process instructions.

Specifically, because the command packet sent by the CPU includes multiple types of formats, and the packet header definition in the command packet carries the format definition of the command packet, when the command packet is analyzed as the dispatch packet, the process object in the process object queue needs to be acquired, otherwise, the process object does not need to be taken out. Therefore, each time the microprocessor reads one command packet, whether the command packet is an dispatch packet is judged, if the command packet is the dispatch packet, the corresponding process object is taken out from the process object queue, and the process object queue and the command queue are put in and taken out in sequence. At the moment, the microprocessor can simultaneously acquire the process object while acquiring the corresponding dispatch packet, so that the address translation and reading processes of the process object do not occupy the microprocessor any more, and the problem of idle waiting caused by the occupied microprocessor is solved. Moreover, because the microprocessor is realized by software programming, although the running speed of the microprocessor is higher, the time delay of the read-write interactive processing of external hardware is larger; the hardware processing module is hardware, and the hardware processing speed is far higher than that of software, so that the pre-fetching tasks of the process objects of the package and the dispatch package are completed by the hardware in advance, the microprocessor software does not need to wait for the process objects, the dispatch package can be directly processed, and the execution efficiency of the small task dispatch package which can be completed by the computation execution unit can be greatly improved.

Each dispatch packet corresponds to one process object, each process comprises a plurality of work groups, each work group comprises a plurality of waves, each wave comprises a plurality of threads, and the threads are the minimum units for task execution, so that each work group comprises a plurality of tasks. The dispatch package includes information such as standard task amount of the workgroup, task amount of the current dispatch package, and the like, in addition to the virtual address of the process object. The process object includes a start address of the process command execution, an offset address of the start of the process command execution, and configuration parameters necessary for the process execution, such as hardware resource configuration.

The step of dividing the process into a plurality of working groups specifically comprises the following steps: and obtaining information of the standard task quantity of the workgroup and the task quantity of the current dispatch package contained in the dispatch package, and correspondingly grouping the task quantity of the current dispatch package according to the standard task quantity of the workgroup, wherein for example, the task quantity is divided into B/A groups when the standard task quantity of the workgroup is A and the current task quantity is B.

In summary, the embodiment of the present invention provides a packet processing method based on a heterogeneous structure system, where before a microprocessor parses a packet, a command packet and a process object corresponding to a dispatch packet are prefetched in advance, the command packet is added to a command queue, the process object is added to a corresponding process object queue, and when the microprocessor parses the dispatch packet, the corresponding process object can be obtained and then dispatched for a corresponding task, so as to solve the problem that in the prior art, the microprocessor needs to wait for translating a physical address of the process object and read a lengthy time of the process object according to the physical address when parsing the dispatch packet, thereby improving the operating efficiency of the microprocessor.

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A heterogeneous structure system package processing method is applied to a command control unit of asynchronous equipment, wherein the command control unit comprises a microprocessor and a hardware processing module, the hardware processing module comprises a prefetch scheduling module, an address conversion unit, a reorder buffer unit and a work group dispatch unit, and the method comprises the following steps:

after the read command packet is returned, adding the returned command packet into a command queue of a reordering cache unit, analyzing the command packet by the reordering cache unit, and sending a virtual address of a process object carried in a dispatch packet to a pre-fetching scheduling module when the command packet is the dispatch packet, wherein the process object comprises a start address, an offset address for starting execution and a hardware resource configuration parameter which are necessary when the process is executed; the pre-fetching scheduling module sends the virtual address of the process object to an address translation unit to obtain the physical address of the process object, and initiates a read command for obtaining the process object to a multi-level cache unit of the equipment; the reordering buffer unit adds the process object returned by the multi-level buffer into the process object queue of the reordering buffer unit and informs the microprocessor that the command packet is ready;

the microprocessor analyzes the command packet, and takes out corresponding process objects from the process object queue when the command packet is a dispatch packet, wherein each process comprises a plurality of workgroups; the microprocessor configures a register for the work group dispatching unit, the work group dispatching unit divides the process into a plurality of work groups and then sends the work groups to the idle calculation execution unit, and the calculation execution unit reads corresponding process instructions and data according to the register and then executes the process instructions.

2. The method of claim 1, wherein the microprocessor reads the command packets in the command queue in a natural order, and reads the process objects in the process object queue in the natural order when the command packets are dispatch packets.

3. The method of claim 1, wherein the order of the process objects in the process object queue is the same as the order of the dispatch packets in the command queue.

4. The method of claim 1, wherein the process object queue and the command queue follow a first-in first-out principle.

5. The method of claim 1, wherein the step of obtaining the physical address of the process object comprises: addressing in the first-level cache of virtual address translation according to the virtual address of the process object, and returning a physical address if matching is successful; if the matching fails, converting the second-level cache query to the virtual address; and if the matching of the virtual address translation secondary cache fails, addressing the virtual address translation secondary cache into an address management unit at the CPU side to obtain the physical address of the process object.

6. The method of claim 1, wherein the step of obtaining the process object comprises: and reading the process object through a multi-level cache access device memory or a system main memory according to the physical address of the process object and returning the process object to the reordering cache unit.

7. The method of claim 1, wherein the step of reading the command packet comprises:

the pre-fetching scheduling module generates a virtual address of the command packet, the address conversion unit converts the virtual address of the command packet into a physical address of the command packet, and reads the corresponding command packet according to the physical address of the command packet.

8. The method of claim 7, wherein the step of reading the corresponding command packet according to the physical address of the command packet comprises: and accessing the system main memory at the CPU side through the multi-level cache according to the physical address of the command packet to read the command packet and returning the command packet to the reordering cache unit.

9. The method of claim 1, wherein the step of the workgroup dispatching unit dividing the process into a plurality of workgroups comprises: and acquiring the standard task quantity of the workgroup contained in the dispatch package and the information of the task quantity of the current dispatch package, and correspondingly grouping the task quantity of the current dispatch package according to the standard task quantity of the workgroup.

10. The method of claim 1, wherein the dispatch package comprises a virtual address of a process object, a standard task volume of a workgroup, and task volume information of a current dispatch package.