Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present disclosure more apparent, the technical solutions of the present disclosure are further elaborated below in conjunction with the drawings and the embodiments, and the described embodiments should not be construed as limiting the present disclosure, and all other embodiments obtained by those skilled in the art without making inventive efforts are within the scope of protection of the present disclosure.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
The term "first/second/third" is merely to distinguish similar objects and does not represent a particular ordering of objects, it being understood that the "first/second/third" may be interchanged with a particular order or precedence where allowed, to enable embodiments of the disclosure described herein to be implemented in other than those illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing the present disclosure only and is not intended to be limiting of the present disclosure.
The embodiment of the disclosure provides a thread bundle scheduling method. The method may be performed by a graphics processor. Fig. 1 is a schematic implementation flow chart of a thread bundle scheduling method according to an embodiment of the present disclosure, as shown in fig. 1, the method may include the following steps S101 to S103:
step S101, obtaining current hardware resource use information of the graphics processor.
Here, the hardware resource usage information may include usage information of any suitable hardware resource in the graphics processor. The hardware resources may include, but are not limited to, at least one of memory access resources, computing resources, communication resources, and the like.
The hardware resource usage information may include, but is not limited to, at least one of a utilization rate of hardware resources, a idleness of hardware resources, an amount of idleness of hardware resources, a busyness of hardware resources, a idleness of hardware resources, and the like.
Step S102, determining a target thread bundle from the thread bundle queue based on the hardware resource use information and the task type of each thread bundle in the thread bundle queue to be scheduled, wherein the task type of each thread bundle corresponds to the hardware resource requirement of the thread bundle.
Here, the thread bundle queue may include a plurality of thread bundles currently to be scheduled in the graphics processor.
Each thread bundle in the thread bundle queue has a corresponding task type, which may reflect the hardware resource requirement for executing the thread bundle.
In practice, one skilled in the art may determine a task type corresponding to a hardware resource requirement for a thread bundle in any suitable manner according to practical situations, and the embodiments of the present disclosure are not limited in this regard.
For example, in the case where there are more memory access resources required to execute a thread bundle, the task type of the thread bundle may be memory intensive, i.e., the thread bundle performs memory intensive tasks.
For another example, where more computing resources are required to execute a thread bundle, the task type of the thread bundle may be computationally intensive, i.e., the thread bundle performs computationally intensive tasks.
For another example, in the case where more communication resources are required to execute a thread bundle, the task type of the thread bundle may be a communication type, that is, the thread bundle executes a communication type task.
In some embodiments, the task type of each thread bundle may be manually set by a developer in advance according to the hardware resource requirements of the thread bundle. For example, when a task execution instruction is triggered, a task type of each thread bundle corresponding to the task execution instruction may be set in a software interface or a function that generates the task execution instruction.
In some implementations, the task type of each thread bundle may be determined by the graphics processor after analyzing the hardware resource requirements of each thread bundle.
The target thread bundle may be determined from the thread bundle queue after comprehensively considering the hardware resource requirements of the thread bundle and the current hardware resource usage of the graphics processor.
Step S103, scheduling and executing the target thread bundle.
Here, after the target thread bundle is determined, the target thread bundle may be scheduled to allocate corresponding hardware resources to execute the target thread bundle.
In the embodiment of the disclosure, current hardware resource use information of a graphics processor is acquired, a target thread bundle is determined from a thread bundle queue based on the hardware resource use information and a task type of each thread bundle in the thread bundle queue to be scheduled, the task type of each thread bundle corresponds to the hardware resource requirement of the thread bundle, and the target thread bundle is scheduled and executed. Therefore, the hardware resources in the graphic processor can be more fully and reasonably utilized by comprehensively considering the hardware resource requirements of the thread bundles and the current hardware resource use condition of the graphic processor, the situation that the scheduled target thread bundles delay execution due to waiting for resource release is reduced, and the processing performance of the graphic processor and the overall execution efficiency of each thread bundle can be further improved. In some embodiments, the task type is obtained through a software interface, the current hardware resource usage information of the graphics processor is obtained through hardware monitoring, and the scheduling of the target thread bundles is performed based on the task type obtained through the software interface and the current hardware resource usage information obtained through hardware monitoring, so that more reasonable scheduling is achieved.
In some embodiments, the step S102 may include the following steps S111 and S112:
step S111, determining a target task type to be scheduled preferentially from a plurality of task types based on the hardware resource usage information.
Here, based on the hardware resource usage information of the graphics processor, a relatively idle hardware resource in the graphics processor may be determined, so that a target task type to be preferentially scheduled may be determined from a plurality of task types according to the relatively idle hardware resource.
In some embodiments, in the case that the utilization of a plurality of hardware resources is included in the hardware resource usage information, the target task type to be preferentially scheduled may be determined based on at least one hardware resource whose utilization is lower than the first threshold. The first threshold may be predetermined by a person skilled in the art according to the actual situation, which is not limited by the embodiment of the present disclosure.
In some embodiments, in the case that the amount of idleness of the plurality of hardware resources is included in the hardware resource usage information, the target task type to be preferentially scheduled may be determined based on at least one hardware resource having the amount of idleness higher than the second threshold. Wherein the second threshold may be predetermined by a person skilled in the art according to the actual situation, and the embodiment of the present disclosure is not limited thereto.
Step S112, determining a target thread bundle from at least one candidate thread bundle in the thread bundle queue, where a task type of the candidate thread bundle is the target task type.
Here, the manner of determining the target thread bundle from the at least one candidate thread bundle may be determined according to an actual situation, which is not limited by the embodiment of the present disclosure.
In some implementations, the candidate thread bundles that are ranked first in the thread bundle queue may be determined as target thread bundles in the order of the candidate thread bundles in the thread bundle queue.
In some implementations, the candidate thread bundle that consumes the least amount of hardware resources among the candidate thread bundles may be determined as the target thread bundle.
In the above embodiment, based on the hardware resource usage information, the target task type to be preferentially scheduled is determined from a plurality of task types, and the target thread bundle is determined from at least one candidate thread bundle in the thread bundle queue, where the task type of the candidate thread bundle is the target task type. In this way, the target task type to be preferentially scheduled can be simply and quickly determined.
In some embodiments, the hardware resource usage information includes utilization of a plurality of hardware resources;
The step S111 may include the following steps S121 and S122:
step S121, determining a target hardware resource with the lowest utilization rate from the plurality of hardware resources.
Step S122, determining a target task type to be scheduled preferentially from a plurality of task types based on the target hardware resource, where a hardware resource requirement corresponding to the target task type is matched with the target hardware resource.
It can be understood that the target hardware resource is the hardware resource with the lowest utilization rate in the various hardware resources, that is, the target hardware resource is the hardware resource with the highest idle degree in the various hardware resources, so that based on the target hardware resource, the task type with the hardware resource requirement matched with the target hardware resource can be selected from the various task types as the target task type to be scheduled preferentially.
Therefore, the determined target thread bundles are more adaptive to the current use condition of hardware resources of the graphic processor, so that the hardware resources in the graphic processor can be more reasonably utilized, and the processing performance of the graphic processor and the overall execution efficiency of each thread bundle are further improved.
In some embodiments, the plurality of hardware resources includes memory access resources, computing resources, and communication resources, and the plurality of task types includes memory access intensive, computation intensive, and communication types.
The above step S122 may include at least one of the following steps S131 to S133:
In step S131, in the case that the target hardware resource is the memory access resource, it is determined that the target task type to be preferentially scheduled is memory access intensive.
Therefore, the target hardware resource is the memory access resource, which indicates that the utilization rate of the current memory access resource in the graphics processor is the lowest, so that the graphics processor can schedule the memory intensive thread bundles with larger demand on the memory access resource preferentially by determining the target task type to be scheduled preferentially as the memory intensive type, and the utilization rate of the memory access resource can be improved. In addition, because the demand of the memory intensive thread bundles on other hardware resources except the memory access resources is relatively small, the situation that the scheduled target thread bundles delay execution due to the fact that the other hardware resources except the memory access resources are released can be reduced, and the processing performance of the graphics processor is further improved.
In step S132, in the case that the target hardware resource is the operation resource, it is determined that the target task type to be preferentially scheduled is computationally intensive.
Therefore, the target hardware resource is the operation resource, which indicates that the utilization rate of the current operation resource in the graphic processor is the lowest, so that the graphic processor can schedule the computationally intensive thread bundles with larger operation resource demands preferentially by determining the type of the target task to be scheduled preferentially as computationally intensive, and the utilization rate of the operation resource can be improved. In addition, because the demand of the computation intensive thread bundles on other hardware resources except the operation resources is relatively small, the situation that the scheduled target thread bundles delay execution due to the fact that the other hardware resources except the operation resources are released can be reduced, and the processing performance of the graphics processor is further improved.
Step S133, determining that the target task type to be scheduled preferentially is a communication type in the case that the target hardware resource is the communication resource.
In this way, the target hardware resource is the communication resource, which indicates that the utilization rate of the current communication resource in the graphics processor is the lowest, so that the graphics processor can schedule the thread bundle of the communication type with larger demand for the communication resource preferentially by determining the target task type to be scheduled preferentially as the communication type, and the utilization rate of the communication resource can be improved. In addition, as the demands of the thread bundles of the communication type on other hardware resources except the communication resources are relatively smaller, the situation that the scheduled target thread bundles delay execution due to the fact that the other hardware resources except the communication resources are released can be reduced, and the processing performance of the graphics processor is further improved.
In some embodiments, the step S101 may include at least one of the following steps S141 to S143:
Step S141, determining the utilization rate of the memory access resource based on the maximum access bandwidth supported by the graphics processor and the current access bandwidth usage of the graphics processor, where the plurality of hardware resources include the memory access resource.
Here, the maximum memory access bandwidth refers to a maximum memory access bandwidth supported by a memory in the graphics processor, where the memory may be a main memory and/or other storage modules in the graphics processor, which is not limited by the embodiments of the present disclosure.
The maximum memory bandwidth characterizes a maximum number of memory request bytes receivable within a single clock cycle. For example, the maximum memory bandwidth may be a theoretical bandwidth of the memory, which may be an inherent hardware parameter of the memory.
The current access bandwidth usage of the graphics processor may refer to the number of actually received access request bytes in the current clock cycle, or the number of access request bytes received in each clock cycle in a plurality of clock cycles before the current clock cycle, which is not limited by the embodiment of the present disclosure.
In some embodiments, a ratio between a current memory access bandwidth usage of the graphics processor and a maximum memory access bandwidth supported by the graphics processor may be determined as a utilization of a current memory access resource of the graphics processor.
Thus, the current utilization rate of the memory access resource of the graphic processor can be rapidly and accurately determined.
In step S142, in the case that the plurality of hardware resources include an operation resource, the utilization rate of the operation resource is determined based on the maximum operation amount supported by the graphics processor in a single clock cycle and the actual operation amount of the graphics processor in the current clock cycle.
Here, the maximum amount of operation supported by the graphics processor in a single clock cycle refers to the maximum amount of operation (i.e., the number of operations) that an operation unit in the graphics processor can provide in a single clock cycle.
In some embodiments, the maximum operand may be a hardware design parameter of the graphics processor.
The actual operand of the graphics processor in the current clock cycle may refer to the number of operations actually performed by the operation unit in the graphics processor in the current clock cycle, or may be the number of operations performed by the operation unit in the graphics processor in each clock cycle in a plurality of clock cycles from the current clock cycle onward, which is not limited by the embodiment of the present disclosure.
In some embodiments, the ratio between the actual operand of the graphics processor in the current clock cycle and the maximum operand supported by the graphics processor in a single clock cycle may be determined as the utilization of the graphics processor's current computational resources.
Thus, the current utilization rate of the computing resources of the graphics processor can be rapidly and accurately determined.
Step S143, in the case that the plurality of hardware resources include communication resources, determining a utilization rate of the communication resources based on a maximum communication bandwidth supported by the graphics processor and a current communication bandwidth usage amount of the graphics processor.
Here, the maximum communication bandwidth refers to a maximum communication bandwidth supported by a communication unit in the graphic processor, wherein the communication unit may include, but is not limited to, at least one of a high-speed serial computer expansion bus standard (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIE) unit, a NVLink (a bus and communication protocol) based communication unit, and the like, and the embodiments of the present disclosure are not limited thereto.
The maximum communication bandwidth characterizes a maximum number of communication request bytes that a communication unit in the graphics processor can receive in a single clock cycle. For example, the maximum communication bandwidth may be a theoretical bandwidth of the communication unit, which may be a hardware design parameter of the communication unit in the graphics processor.
The current communication bandwidth usage of the graphics processor may refer to the number of communication request bytes actually received by the communication unit in the current clock cycle, or may be the number of communication request bytes received by the communication unit in each clock cycle in a plurality of clock cycles before the current clock cycle, which is not limited by the embodiment of the present disclosure.
In some implementations, communication resources in the graphics processor are used to support communication between the graphics processor and other processors. The other processors may include, but are not limited to, at least one of other graphics processors, other central processing units (Central Processing Unit, CPU), etc.
In some embodiments, the ratio between the current communication bandwidth usage of the graphics processor and the maximum communication bandwidth supported by the graphics processor may be determined as the utilization of the current communication resources of the graphics processor.
Thus, the current utilization rate of the computing resources of the graphics processor can be rapidly and accurately determined.
In some embodiments, the above method may further include the following step S151 and step S152:
Step S151, receiving a task execution instruction, where the task execution instruction includes task information of a task to be executed, and the task information includes task execution parameters and a preset task type.
Here, the task execution instruction may be generated by a host processor (e.g., CPU or other GPU, etc.) through an interface or function in the calling program software and transmitted to the graphics processor during execution of the task processing program.
In practice, the host processor may generate task execution instructions by invoking any suitable interface or function and sending the task execution instructions to the graphics processor, as the disclosed embodiments are not limited in this respect. For example, task execution instructions may be generated and sent to a graphics processor by invoking a kernel function (GPU kernel) of the graphics processor under a unified computing device architecture (Compute Unified Device Architecture, CUDA).
The task execution parameters may include any suitable parameters required for executing the task, and those skilled in the art may reasonably set according to the actually executed task.
The preset task type is used for representing the task type of the task to be executed, and corresponds to the hardware resource requirement of the task to be executed.
In some embodiments, the preset task type of the task to be performed may be preset manually by a developer according to the hardware resource requirement of the task to be performed. For example, when a software interface or a function of a task execution instruction corresponding to a task to be executed is written, a task type of the task to be executed may be set in the software interface or the function.
In some implementations, the task type of task to be performed may be determined by analyzing the hardware resource requirements of each thread bundle.
In some embodiments, the preset task type may be represented by a preset type value. Wherein, corresponding type values can be set for various task types in advance. For example, the computation intensive may be denoted by "0", the memory intensive by "1", and the communication type by "2".
Step S152, based on the task execution parameters, creating at least one new thread bundle, and adding the at least one new thread bundle into the thread bundle queue, wherein the task type of each new thread bundle is the preset task type.
Here, at least one new thread bundle for executing the task to be executed may be created based on task execution parameters of the task to be executed, so that task contents of the task to be executed are divided into each new thread bundle for execution. And, the preset task type of the task to be executed can be used as the task type of each new thread bundle.
In practice, one skilled in the art may create at least one newly created thread bundle based on task execution parameters in any suitable manner, as the case may be, and the embodiments of the present disclosure are not limited in this regard.
In the embodiment, a task execution instruction is received, the task execution instruction comprises task information of a task to be executed, the task information comprises task execution parameters and preset task types, at least one new thread bundle is created based on the task execution parameters, and the at least one new thread bundle is added into a thread bundle queue, wherein the task type of each new thread bundle is the preset task type. In this way, the task type corresponding to at least one newly-built thread bundle corresponding to the task to be executed can be designated by carrying the preset task type of the task to be executed in the task execution instruction, so that the task type of each thread bundle in the thread bundle queue can be more accurate and flexible to set.
In some embodiments, the step S151 may include the following step S161:
Step 161, receiving the task execution instruction sent by the main processor, wherein the task execution instruction is generated by the main processor based on the task execution parameter and the preset task type, and the preset task type is determined by the main processor based on the hardware resource requirement of the task to be executed.
Here, the main processor may determine a preset task type based on a hardware resource requirement of a task to be executed, generate a task execution instruction based on a task execution parameter of the task to be executed and the preset task type, and then issue the task execution instruction to the graphic processor.
In some embodiments, the main processor may generate a task execution instruction based on the task execution parameter of the task to be executed and the preset task type by calling a preset software interface, and send the task execution instruction to the graphics processor. The graphic processor can receive a task execution instruction sent by the main processor through the preset software interface. For example, the preset software interface may include, but is not limited to, a kernel function of the graphics processor.
In this way, the main processor can flexibly set a proper task type for the task to be executed according to the hardware resource requirement of the task to be executed, and generate a corresponding task execution instruction, so that the corresponding task type can be conveniently and quickly designated for at least one newly-built thread bundle corresponding to the task to be executed in a software mode.
Creating at least one new thread bundle based on the task execution parameters in the above step S152 may include the following steps S162 and S163:
Step S162, determining task type codes based on the preset task types.
In some embodiments, after receiving the task execution instruction, the graphics processor may parse the task execution instruction to obtain a task execution parameter and a preset task type of a task to be executed included in the task execution instruction.
After the preset task type is obtained, the graphic processor can perform coding processing on the preset task type to obtain task type codes. The graphics processor may perform the encoding processing on the preset task type by using any suitable encoding manner, which is not limited in the embodiments of the present disclosure.
In some embodiments, different codes may be preset for each candidate task type in the set of candidate task types, and after obtaining the preset task type, the image processor may use the code corresponding to the preset task type as the task type code. For example, in the case where the candidate task types are computationally intensive, memory access intensive, and communication types, the code "00" for the computationally intensive correspondence may be preset, the code "01" for the memory access intensive correspondence, and the code "10" for the communication type.
In some embodiments, corresponding type values may be set for multiple task types in advance, and the type value corresponding to the task type may be converted into a binary code, so as to obtain a task type code corresponding to the task type. For example, the type value 0 may be used to represent the computation-intensive type, the type value 1 may be used to represent the memory-intensive type, and the type value 2 may be used to represent the communication type, so that the task type corresponding to the computation-intensive type is encoded as "00", the task type corresponding to the memory-intensive type is encoded as "01", and the task type corresponding to the communication type is encoded as "10".
Step S163, creating at least one new thread bundle based on the task execution parameters, and associating the at least one new thread bundle with the task type codes, respectively, where the task type codes are used to characterize that the task type of the new thread bundle is the preset task type.
Here, after the graphics processor creates at least one new thread bundle for executing the task to be executed based on the task execution parameters, each new thread bundle may be associated with the task type code. In this way, the task type code associated with each thread bundle can be referred to in the process of determining and scheduling the target thread bundle from the thread bundle queue, and the task type of each thread bundle can be quickly and intuitively determined according to the task type code.
Embodiments of the present disclosure provide a thread bundle scheduling apparatus that may be applied in a graphics processor. Fig. 2 is a schematic diagram of a composition structure of a thread bundle scheduling apparatus according to an embodiment of the present disclosure, as shown in fig. 2, the thread bundle scheduling apparatus 200 includes a resource monitoring module 210 and a scheduling execution module 220, where:
a resource monitoring module 210, configured to obtain current hardware resource usage information of the graphics processor;
The scheduling execution module 220 is configured to determine a target thread bundle from the thread bundle queue based on the hardware resource usage information and a task type of each thread bundle in the thread bundle queue to be scheduled, where the task type of each thread bundle corresponds to a hardware resource requirement of the thread bundle, and schedule the execution of the target thread bundle.
In the embodiment of the disclosure, the hardware resources in the graphics processor can be more fully and reasonably utilized by comprehensively considering the hardware resource requirements of the thread bundles and the current hardware resource use condition of the graphics processor to schedule the thread bundles, the situation that the scheduled target thread bundles delay execution due to the release of waiting resources is reduced, and further the processing performance of the graphics processor and the overall execution efficiency of each thread bundle can be improved.
In some embodiments, the scheduling execution module 220 is further configured to determine a target task type to be preferentially scheduled from a plurality of task types based on the hardware resource usage information, and determine a target thread bundle from at least one candidate thread bundle in the thread bundle queue, where the task type of the candidate thread bundle is the target task type.
In some embodiments, the hardware resource usage information includes utilization rates of multiple hardware resources, the scheduling execution module 220 is further configured to determine a target hardware resource with a lowest utilization rate of the multiple hardware resources, and determine a target task type to be scheduled preferentially from multiple task types based on the target hardware resource, where a hardware resource requirement corresponding to the target task type matches the target hardware resource.
In some embodiments, the plurality of hardware resources comprise memory access resources, operation resources and communication resources, the plurality of task types comprise memory access intensive, computation intensive and communication types, and the scheduling execution module 220 is further configured to determine that a target task type to be preferentially scheduled is memory intensive if the target hardware resource is the memory access resources, determine that a target task type to be preferentially scheduled is computation intensive if the target hardware resource is the operation resources, and determine that a target task type to be preferentially scheduled is communication type if the target hardware resource is the communication resources.
In some embodiments, the resource monitoring module 210 is further configured to determine, in the case where the plurality of hardware resources includes a memory access resource, a utilization of the memory access resource based on a maximum memory bandwidth supported by the graphics processor and a current memory bandwidth usage of the graphics processor, determine, in the case where the plurality of hardware resources includes an operation resource, a utilization of the operation resource based on a maximum operation supported by the graphics processor in a single clock cycle and an actual operation of the graphics processor in a current clock cycle, and determine, in the case where the plurality of hardware resources includes a communication resource, a utilization of the communication resource based on a maximum communication bandwidth supported by the graphics processor and a current communication bandwidth usage of the graphics processor.
In some embodiments, as shown in fig. 3, the thread bundle scheduling apparatus 200 further includes a thread bundle creation module 230 configured to receive a task execution instruction, where the task execution instruction includes task information of a task to be executed, the task information includes a task execution parameter and a preset task type, create at least one new thread bundle based on the task execution parameter, and add the at least one new thread bundle to the thread bundle queue, where a task type of each new thread bundle is the preset task type.
Embodiments of the present disclosure provide a graphics processor. As shown in fig. 4, the graphics processor 400 includes the thread bundle scheduler 200 described above.
Embodiments of the present disclosure provide a computer device. As shown in fig. 5, the computer device 500 includes the graphics processor 400 described above.
The application of the thread bundle scheduling method provided by the embodiment of the disclosure in an actual scene is described below.
GPUs are widely used in the processing context of multithreaded tasks and perform tasks at a granularity of thread bundles, typically each thread bundle containing several threads. When there are many parallel thread bundles, the GPU needs to decide in some scheduling manner which thread bundles can be executed. In the related art, the scheduling method of the thread bundles in the GPU is Round-Robin (Round-Robin) scheduling.
However, the workload (i.e., the hardware resources required for executing) of different thread bundles may be different, and the hardware resources (such as an arithmetic unit, or access bandwidth, etc.) in the GPU are limited, so the polling scheduling manner may cause some hardware resources to be too much occupied, so that the thread bundles scheduled out cannot be executed effectively due to the need to wait for the release of the hardware resources.
In the thread bundle scheduling method provided in the embodiments of the present disclosure, the type of workload of each thread bundle in the GPU (corresponding to the task type in the foregoing embodiments) may be provided through a software interface, such as computationally intensive, memory intensive, or inter-card communication type (corresponding to the communication type in the foregoing embodiments). In the process of scheduling the thread bundles by hardware, according to the types of the workload of each thread bundle in the thread bundle queue to be scheduled, the current hardware resource occupation condition (such as the bandwidth utilization rate of a main memory, the operation resource utilization rate, the communication resource utilization rate and the like) in the GPU is combined to perform more effective scheduling, so that the hardware resource is better utilized.
Fig. 6 is a schematic diagram of an implementation architecture of a thread bundle scheduling method according to an embodiment of the present disclosure. As shown in fig. 6, the method may be implemented by a software layer 61 in conjunction with hardware 62. In the software layer 61, according to the knowledge of the task to be executed, when the GPU KERNEL (Kernels) is started to execute the task to be executed, a preset task TYPE keyel_type of the task to be executed is defined through a software interface to indicate the requirement of the task to be executed on different hardware resources in the GPU. The preset task TYPE KERNEL_TYPE may be, for example, computationally intensive, memory access intensive, or inter-card communication TYPE. Taking CUDA call GPU kernel (kernel function) as an example to generate a task execution instruction, the called software interface may be:
gpu_kernel<<<BLOCK_NUMBER,BLOCK_SIZE,KERNEL_TYPE>>>(ARG_A,ARG_B,ARG_C);
Wherein block_number represents the NUMBER of a thread BLOCK executing a corresponding task to be executed, block_size represents the SIZE of the thread BLOCK, KERNEL_TYPE represents a preset task TYPE, and ARG_ A, ARG _B and ARG_C are task execution parameters of the task to be executed. Wherein, when KERNEL_TYPE is 0, it can be defined that the task to be executed is a computationally intensive task, when KERNEL_TYPE is 1, it is defined that the task to be executed is a memory intensive task, and when KERNEL_TYPE is 2, it is defined that the task to be executed is an inter-card communication TYPE task used for inter-card communication.
With continued reference to fig. 6, in the hardware 62 of the GPU, according to the preset task TYPE keyel_type that is transmitted by the software, the hardware 62 may encode the preset task TYPE keyel_type to obtain the task TYPE of each thread bundle corresponding to the task to be executed. Taking the three currently supported task types (i.e., compute intensive, memory intensive, and inter-card communication type) as an example, the hardware 62 only needs to do 2-bit (bit) encoding, e.g., encoding "00" for compute intensive, encoding "01" for memory intensive, encoding "10" for inter-card communication type. After dividing the task to be executed corresponding to the GPU kernel into a plurality of thread bundles (corresponding to the newly created thread bundles in the foregoing embodiment), each thread bundle after division is attached with the 2-bit code as a task Type (such as task types 0, type1, typeN) for the scheduler to recognize, each thread bundle after division is added to a thread bundle queue Warps _array to be scheduled, and the thread bundle queue can record the task Type of each thread bundle (such as thread bundles warp_0, warp_1, warp_n).
The hardware 62 of the GPU may include a scheduler 621, a selector 622, a hardware monitor 623, and hardware resources for performing tasks, including an arithmetic unit 624, a memory unit 625, and a communication unit 626.
The hardware monitor 623 may be used to monitor how busy different hardware resources are in the GPU. For access resources, the bandwidth of the main memory may be monitored. Such as the number of request bytes received in a fixed number of clock cycles (cycles). Then dividing by the theoretical maximum number of request bytes that can be received to obtain the bandwidth utilization of the memory access. For example, the theoretical bandwidth of the main memory is 10 bytes/cycle (this is a hardware inherent attribute, so for a certain hardware, it is a constant value), and the bandwidth utilization of the memory access can be calculated by hardware to be 500/(10×1000) =5% when the total of 500 bytes of request data are received in 1000 cycles. Similarly, for the arithmetic unit/communication unit, the corresponding utilization can be calculated. If the PCIE communication unit is used as the communication unit, the utilization rate of the communication module is the bandwidth utilization rate of PCIE (only PCIE communication unit is not limited here). The arithmetic unit, such as an ALU, also has an index of the amount of operations (i.e., the number of operations) that can be provided per unit time. The hardware monitor 623 may count the actual workload of a certain type of hardware resource within a certain period of time, and then divide the actual workload by the theoretical maximum workload upper limit of the type of hardware resource to obtain the utilization rate of the type of hardware resource.
Scheduler 621 needs to do a weighted round-robin (weighted round-robin) to schedule the execution thread bundles in conjunction with the utilization of each hardware resource monitored by hardware monitor 623. For example, the current thread bundle queue has 10 thread bundles, where the task type of the last two thread bundles is computationally intensive (the scheduler 621 may recognize according to the code characterizing the task type), at which time the hardware monitor 623 feeds back that the operation unit 624 is at 50% and the memory unit 625 is at 90%, and the scheduler 621 may consider that the operation unit 624 is idle at this time, so that the last two computationally intensive thread bundles may be preferentially scheduled by the selector 622.
The thread bundle scheduling method provided by the embodiment of the disclosure can more reasonably utilize hardware resources in the GPU, otherwise, when the utilization rate of the access unit is very high, a thread bundle with heavier access memory (i.e. the task type is the thread bundle with intensive access memory) is still allocated, and the execution of the thread bundle is possibly stagnated, so that the overall processing performance of the GPU is affected. Fig. 7 is a schematic diagram showing task execution timing comparison of a polling scheduling scheme in the related art, where, as shown in fig. 7, a thread bundle corresponding to a memory task 1, a thread bundle corresponding to a memory task 2, and a thread bundle corresponding to a computing task 3 are sequentially scheduled in a polling scheduling scheme 71, and the memory task 2 may be executed after the memory task 1 is completed due to a memory bandwidth limitation, so that the memory task 1, the memory task 2, and the computing task 3 may take 12ms to execute, but in the thread bundle scheduling method 72 provided in the embodiment of the present disclosure, the thread bundle corresponding to the memory task 1 is scheduled preferentially, and then the thread bundle corresponding to the memory task 2 is scheduled again after the thread bundle corresponding to the memory task 1 is scheduled, so that the thread bundle corresponding to the computing task 3 may complete the memory task 1, the memory task 2, and the computing task 3 without waiting for the thread bundle corresponding to end. Therefore, the thread bundle scheduling method provided by the embodiment of the disclosure can more reasonably utilize hardware resources in the GPU, and improve the overall performance of task processing.
It should be noted that, the resource monitoring module in the foregoing embodiment may be implemented by the above-mentioned hardware monitor, and the scheduling execution module may be implemented by the above-mentioned scheduler and selector.
It should be noted herein that the above description of various embodiments is intended to emphasize the differences between the various embodiments, and that the same or similar features may be referred to each other. The above description of apparatus, graphics processor and computer device embodiments is similar to that of method embodiments described above, with similar benefits as the method embodiments. In some embodiments, the apparatus, the graphics processor, and the computer device provided by the embodiments of the present disclosure may have functions or included modules that may be used to perform the methods described in the method embodiments above, and for technical details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the description of the embodiments of the method of the present disclosure for understanding.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present disclosure, the size of the sequence numbers of the steps/processes described above does not mean the order of execution, and the order of execution of the steps/processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation of the embodiments of the present disclosure. The foregoing embodiment numbers of the present disclosure are merely for description and do not represent advantages or disadvantages of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In several embodiments provided in the present disclosure, it should be understood that the disclosed graphics processor and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place or distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment. In addition, each functional unit in each embodiment of the disclosure may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.
It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be implemented by hardware associated with program instructions, where the above program may be stored in a computer readable storage medium, where the program when executed performs the steps comprising the above method embodiments, where the above storage medium includes various media that may store program code, such as a removable storage device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
The foregoing is merely an embodiment of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present disclosure should be included in the protection scope of the present disclosure.