Detailed Description
To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the packet processing method based on the heterogeneous structure system according to the present invention with reference to the accompanying drawings and preferred embodiments shows the following detailed descriptions of the specific implementation, structure, features and effects thereof. In the following description, the different references to "one embodiment" or "another embodiment" do not necessarily refer to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following describes a specific scheme of the packet processing method based on the heterogeneous structure system provided by the present invention in detail with reference to the accompanying drawings.
Referring to fig. 1, a structure of a heterogeneous structure system applied in an embodiment of the present invention is shown, where the heterogeneous structure system includes two major parts: a central processing unit and a plurality of asynchronous devices. The Central Processing Unit (CPU) is responsible for control and is used for dispatching tasks and performing complex control processing; the asynchronous device performs calculations according to the commands of the CPU.
The central processing unit comprises a CPU system main memory and an address management unit. A CPU-initiated program may be broken down into tens of thousands of command queues, where several command queues are stored in the CPU main memory or the device memory, where each command queue is a packet queue made up of several packets that are mapped onto a command queue in a queue group in the command control units of several asynchronous devices. The address management unit is responsible for management and interpretation of the final physical address.
Each asynchronous device includes a command control unit, a plurality of parallel computing execution units, a virtual address translation secondary cache, a multi-level cache, a PCIE interface, an on-chip high-bandwidth memory, and a resource allocation module (not shown in the figure). The command control unit is responsible for receiving and splitting commands and issuing the commands to the parallel computing execution unit; the parallel computing execution unit is used for executing corresponding tasks; the virtual address secondary cache is used as a cache for temporarily exchanging data between the virtual address primary cache and an address management unit at the CPU side; the multi-level cache is a cache of a data reading channel; the on-chip high-width memory is a memory on the current asynchronous equipment side and is also called an equipment memory; the PCIE interface is a high-speed serial computer expansion bus standard, and the central processing unit and the asynchronous equipment communicate through the PCIE interface.
The command control unit comprises a plurality of microprocessors and hardware processing modules, the microprocessors are functional modules realized by software programming, and the hardware processing modules are formed by hardware circuits. The hardware processing module comprises a pre-fetching scheduling module, a virtual address translation first-level cache, a reading module, a plurality of work group dispatching units and a reordering cache unit for storing a command queue group. Because a microprocessor can serve several processes at the same time, and each process is disassembled into several command queues, a pre-fetching scheduling module is needed to schedule the corresponding command packet, and meanwhile, in order to solve the problem that the microprocessor is occupied for a long time, the process object also needs to be pre-fetched in advance in the embodiment of the invention, so the pre-fetching scheduling module also comprises the pre-fetching and scheduling of the process object. The virtual address translation primary cache, the virtual address translation secondary cache and the address management unit form an address translation system for translating virtual addresses into physical addresses. The reading module is mainly used for reading out the packet or the process object through the multi-level cache according to the obtained physical address. The reordering buffer unit is mainly used for temporarily storing the command queue and the process object queue. The work group dispatching unit is mainly used for configuring corresponding registers and grouping dispatching tasks.
Specifically, referring to fig. 2, after the CPU notifies the packet queue information corresponding to the pre-fetch scheduling module, the pre-fetch scheduling module generates a packet read virtual address, the address translation system translates the packet virtual address into a physical address, the read module initiates a packet read command to the multi-level cache unit, the multi-level cache stores the read packet from the system main memory on the CPU side and returns the read packet to the reordering cache unit, and the read command packet is added to the corresponding command queue; after the read command packet is returned, when the reordering cache unit judges that the returned command packet is a dispatch packet, the dispatch packet is analyzed to obtain a virtual address of the process object, the virtual address of the process object is sent to the pre-fetching scheduling module, and the address translation system carries out address translation on the virtual address of the process object to obtain a physical address of the process object; the reading module initiates a reading command of the process object to the multi-level cache, and the multi-level cache reads the process object from a main memory or an equipment memory of the system and returns the process object to the reordering cache unit; the reordering buffer unit adds the process object into the process object queue and informs the microprocessor that the packet is ready; the microprocessor analyzes the command packet, and analyzes the process object in the dispatching packet when the identification carried in the command packet is the dispatching packet; the dispatch package and the process object thereof are read before the microprocessor analyzes, so that after the microprocessor acquires the dispatch package and the corresponding process object thereof, the dispatch register is configured for the workgroup dispatch unit, the workgroup dispatch unit divides the process into a plurality of workgroups and issues the workgroups to the idle calculation execution unit, and the calculation execution unit reads process instructions and data from the main memory and the equipment memory of the system according to the dispatch register and executes corresponding instructions. The resource allocation module is used for selecting an idle calculation execution unit to send down a working group.
Referring to fig. 3, a flowchart of a packet processing method based on a heterogeneous structure system according to an embodiment of the present invention is shown, where the packet processing method is applied to a command control unit of an asynchronous device, where the command control unit includes a microprocessor and a hardware processing module, and the method includes the following steps:
step S001, after the read command packet is returned, adding the returned command packet into a command queue of a reordering cache unit, analyzing the command packet by the reordering cache unit, and sending a virtual address of a process object carried in a dispatch packet to a pre-fetching module when the command packet is the dispatch packet; the pre-fetching module sends the virtual address of the process object to the address translation unit to obtain the physical address of the process object and obtain the process object; the reordering buffer unit adds the acquired process object into the process object queue of the reordering buffer unit, and informs the microprocessor that the command packet is ready.
The AQL packets in the HSA heterogeneous system have various forms, such as dispatch packets, custom packets, and isochronous packets, and the tasks issued by the CPU are mainly completed by the dispatch packets, the dispatch packet ratio is the largest among the command packets in the command queue, only the dispatch packet in each command packet includes the virtual address of the process object, and the actual process object is stored in the device memory or the system main memory.
In the conventional packet processing method, a microprocessor needs to wait for the return of a request for reading a process object, and the reading of the process object needs to be performed through the conversion from a virtual address to a physical address and the long time of reading from a device memory through a multi-level cache. Specifically, the analysis of the dispatching package and the acquisition of the process object are processed by the hardware processing module, so that in the process of dispatching the previous command package by the microprocessor, the hardware processing module can analyze the package of the next dispatching package and acquire the process object in advance; after the microprocessor processes the dispatch of the previous command dispatch package, the microprocessor can directly analyze the hardware processing module and obtain the next dispatch package of the process object for new dispatch; because the process object and the package are pre-fetched through the hardware processing module, the microprocessor does not need to wait for the process object, and can directly process the dispatch package, the execution efficiency of the computing execution unit for completing the small task dispatch package can be greatly improved, the high efficiency of the hardware processing module for analyzing and processing the package can be fully exerted, the dispatch package is delivered to the hardware processing module for accelerated processing, and the processing time required by the package analysis and the package processing is greatly shortened.
Referring to fig. 6, the process of reading the command packet is almost the same as the process of reading the process object, and it is necessary to convert the virtual address of the packet or the process object into a physical address and then read the corresponding packet or the process object according to the physical address. The process of converting the virtual address into the physical address comprises the following specific steps: addressing in a first-level cache of virtual address translation according to the generated virtual address, and returning a physical address if matching is successful; if the matching fails, the inquiry to the virtual address translation secondary cache is needed, and if the matching of the virtual address translation secondary cache also fails, the address is addressed to the address management unit at the CPU side, so that the final physical address can be obtained. The process for reading a package or process object includes: after the corresponding physical address is obtained, the process object is read and returned through the memory of the multi-level cache access device according to the physical address, or the main memory of the system is accessed through the multi-level cache to read and return the packet.
The reordering buffer unit comprises a plurality of command queues and process object queues, each command queue comprises a plurality of command packets, and each process object queue comprises a plurality of process objects. And each queue follows a first-in first-out principle, the microprocessor reads the command packets in the command queue according to a natural sequence, and reads the process objects in the process object queue according to the natural sequence when the command packets are dispatch packets, wherein the command queue and the process object queue follow the first-in first-out principle. For example, the command queue is added according to the order of virtual address generation of the command packet, specifically, the order of virtual address generation is:packet 1 virtual address, packet 2 virtual address, packet 3 virtual address, … …, packet n virtual address; the command queue is:pack 1, pack 2, pack 3, … …, pack n; correspondingly, ifpacket 1, packet 3 … …, packet m is an dispatch packet, where m is less than n, then the process object queue is:package 1, package 3, … … and package m, and the arrangement sequence of the process objects in the process object queue is the same as the arrangement sequence of the dispatch packages in the command queue. Even if the packets and the process objects are returned out of order when returned, the reordering cache unit adds the packets and the process objects returned out of order to the corresponding queues according to the natural order.
After the dispatch package and the corresponding process objects are respectively placed into the corresponding queues, the microprocessor is informed that the package is ready, so that the occupied time of the microprocessor can be reduced.
Step S002, the microprocessor analyzes the command packet, and analyzes the corresponding process object when the command packet is a dispatch packet, wherein each process comprises a plurality of working groups; the microprocessor configures a register for the work group dispatching unit, the work group dispatching unit divides the process into a plurality of work groups and then sends the work groups to the idle calculation execution unit, and the calculation execution unit reads corresponding process instructions and data according to the register and then executes the process instructions.
Specifically, because the command packet sent by the CPU includes multiple types of formats, and the packet header definition in the command packet carries the format definition of the command packet, when the command packet is analyzed as the dispatch packet, the process object in the process object queue needs to be acquired, otherwise, the process object does not need to be taken out. Therefore, each time the microprocessor reads one command packet, whether the command packet is an dispatch packet is judged, if the command packet is the dispatch packet, the corresponding process object is taken out from the process object queue, and the process object queue and the command queue are put in and taken out in sequence. At the moment, the microprocessor can simultaneously acquire the process object while acquiring the corresponding dispatch packet, so that the address translation and reading processes of the process object do not occupy the microprocessor any more, and the problem of idle waiting caused by the occupied microprocessor is solved. Moreover, because the microprocessor is realized by software programming, although the running speed of the microprocessor is higher, the time delay of the read-write interactive processing of external hardware is larger; the hardware processing module is hardware, and the hardware processing speed is far higher than that of software, so that the pre-fetching tasks of the process objects of the package and the dispatch package are completed by the hardware in advance, the microprocessor software does not need to wait for the process objects, the dispatch package can be directly processed, and the execution efficiency of the small task dispatch package which can be completed by the computation execution unit can be greatly improved.
Each dispatch packet corresponds to one process object, each process comprises a plurality of work groups, each work group comprises a plurality of waves, each wave comprises a plurality of threads, and the threads are the minimum units for task execution, so that each work group comprises a plurality of tasks. The dispatch package includes information such as standard task amount of the workgroup, task amount of the current dispatch package, and the like, in addition to the virtual address of the process object. The process object includes a start address of the process command execution, an offset address of the start of the process command execution, and configuration parameters necessary for the process execution, such as hardware resource configuration.
The step of dividing the process into a plurality of working groups specifically comprises the following steps: and obtaining information of the standard task quantity of the workgroup and the task quantity of the current dispatch package contained in the dispatch package, and correspondingly grouping the task quantity of the current dispatch package according to the standard task quantity of the workgroup, wherein for example, the task quantity is divided into B/A groups when the standard task quantity of the workgroup is A and the current task quantity is B.
Referring to fig. 4 and 5, in order to further illustrate the beneficial effects of the present invention, the prior art is compared with the present solution, wherein fig. 4 shows the conventional processing procedure, and the difference between the prior art and the present solution is: pre-fetching a packet before a microprocessor analyzes the packet, but requiring the microprocessor to obtain a corresponding process object according to a virtual address of the process object carried by a dispatch packet when the command packet is the dispatch packet; in fig. 4, the first line shows the processing procedure of the packet prefetch scheduling module, the second line shows the processing procedure of the reorder buffer unit, and the third line shows the processing procedure of the microprocessor. Specifically, the processing procedure of the packet prefetch scheduling module includes: generating a virtual address of thepacket 1, converting the virtual address of thepacket 1 and reading thepacket 1; generating a packet 2 virtual address, packet 2 virtual address translation, and packet 2 reading; and the process is repeated until the virtual address of the packet n is converted and the packet n is read, and the process is sequentially executed. For the process of the reordering buffer unit: the delay of address translation may cause the time for returning the virtual address of the packet 2 to the physical address to be earlier than the time for returning the virtual address of the packet 1 to the physical address, for example, a page table Entry (Entry) of the packet 2 exists in the first-level cache unit of the virtual address, but the page table Entry of the packet 1 does not exist, at this time, the physical address of the packet 2 returns quickly, but the packet 1 needs to spend a large amount of time to access the multi-level page table, so the return time is later than that of the packet 2, and therefore the return time of virtual address translation is out of order; the delay of reading the packet may also cause the return time of the packet 2 to be earlier than the return time of the packet 1, because the packet may be in the device memory or the system main memory, if the packet 2 is in the device memory and the packet 1 is in the system main memory, the packet 2 will return quickly, the packet 1 needs to access the system main memory on the CPU side again, the return time of the packet 1 will be later than that of the packet 2, and therefore the return sequence of the packet is also out of order, but the reordering cache unit will reorder the returned packets, and then add the packets into the corresponding command queue according to the natural sequence. For the processing of the microprocessor: when the microprocessor reads the command packet, the command packet is also read according to a natural sequence and is sequentially executed, and only when the corresponding dispatch packet is read, the virtual address of the process object in the dispatch packet is utilized for address translation and reading; assuming that the task 2 of thepacket 1 is reading the process object, the process object of thepacket 1 is returned to the reordering cache unit, and the microprocessor can perform the next work after acquiring the process object of thepacket 1. FIG. 5 shows the packet processing process after the improvement provided by the embodiment of the present invention, in FIG. 5, the first line represents the processing process of the pre-fetch scheduling module, the second line represents the processing process of the re-order buffer unit, the third line represents the processing process of the microprocessor, and the fourth line represents the processing process of the execution unit; for the processing of the prefetch module: different from fig. 4, in the improved method, it is also necessary to read the process object in the dispatch packet in advance, that is, after thepacket 1 and the packet 2 return to the reorder buffer unit, the reorder buffer unit needs to analyze thepacket 1 and the packet 2, determine whether the dispatch packet is the dispatch packet, and if the dispatch packet is the dispatch packet, the prefetch scheduling module needs to initiate a request for reading the process object again according to the virtual address of the process object carried by the dispatch packet. For the process of the reordering buffer unit: because reading the packet and reading the process object require a certain time, the returned result sequence to the reordering buffer unit is disordered, but the reordering buffer unit puts the packet and the process object returned out of sequence into the corresponding queue according to the natural sequence for the microprocessor to read. For the processing of the microprocessor: because the package and the process object are ready, the corresponding process object can be obtained while the dispatch package is analyzed, and the task corresponding to the dispatch package can be distributed to the corresponding parallel computing execution unit for execution without waiting.
In summary, the embodiment of the present invention provides a packet processing method based on a heterogeneous structure system, where before a microprocessor parses a packet, a command packet and a process object corresponding to a dispatch packet are prefetched in advance, the command packet is added to a command queue, the process object is added to a corresponding process object queue, and when the microprocessor parses the dispatch packet, the corresponding process object can be obtained and then dispatched for a corresponding task, so as to solve the problem that in the prior art, the microprocessor needs to wait for translating a physical address of the process object and read a lengthy time of the process object according to the physical address when parsing the dispatch packet, thereby improving the operating efficiency of the microprocessor.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.