Movatterモバイル変換


[0]ホーム

URL:


CN117827376B - GPU test task scheduling method and device, electronic equipment and storage medium - Google Patents

GPU test task scheduling method and device, electronic equipment and storage medium
Download PDF

Info

Publication number
CN117827376B
CN117827376BCN202311867994.2ACN202311867994ACN117827376BCN 117827376 BCN117827376 BCN 117827376BCN 202311867994 ACN202311867994 ACN 202311867994ACN 117827376 BCN117827376 BCN 117827376B
Authority
CN
China
Prior art keywords
test task
gpu test
task
containers
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311867994.2A
Other languages
Chinese (zh)
Other versions
CN117827376A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moore Thread Intelligent Technology Chengdu Co ltd
Original Assignee
Moore Thread Intelligent Technology Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Moore Thread Intelligent Technology Chengdu Co ltdfiledCriticalMoore Thread Intelligent Technology Chengdu Co ltd
Priority to CN202311867994.2ApriorityCriticalpatent/CN117827376B/en
Publication of CN117827376ApublicationCriticalpatent/CN117827376A/en
Application grantedgrantedCritical
Publication of CN117827376BpublicationCriticalpatent/CN117827376B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The disclosure relates to a GPU test task scheduling method and device, electronic equipment and storage medium, wherein the method comprises the following steps: receiving a plurality of GPU test tasks submitted by a GPU test system, and creating a custom resource instance corresponding to each GPU test task; determining a target GPU test task which is required to be executed currently in the plurality of GPU test tasks, and creating a container for the target GPU test task by utilizing real-time schedulable resources according to a user-defined resource instance corresponding to the target GPU test task; starting and executing a target GPU test task by using the created container, and determining whether residual test cases which correspond to the target GPU test task and are not executed exist in the GPU test system; and under the condition that the residual test cases exist, carrying out container expansion on the target GPU test task by utilizing the real-time schedulable resources according to the custom resource examples corresponding to the target GPU test task.

Description

GPU test task scheduling method and device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of computers, and in particular relates to a method and a device for scheduling GPU testing tasks, electronic equipment and a storage medium.
Background
During development of a graphics processor (Graphics Processing Unit, GPU), each change in chip or driver requires millions of test cases to verify that its functionality and performance are in line with expectations. In the face of a huge number of test cases, it is obvious that conventional verification by means of a single test task of a single server is very time-consuming, and research and development efficiency is seriously affected. And simultaneously constructing a plurality of GPU test tasks through a containerization technology, and scheduling the plurality of GPU test tasks to a plurality of machines for operation, wherein each GPU test task executes a part of test cases, so that GPU verification can be accelerated. Efficient GPU test task scheduling has an important impact on computing resource utilization and test task execution efficiency, and therefore, a GPU test task scheduling method is highly desirable.
Disclosure of Invention
The disclosure provides a GPU test task scheduling method and device, electronic equipment and a technical scheme of storage media.
According to an aspect of the present disclosure, there is provided a GPU test task scheduling method, including: receiving a plurality of GPU test tasks submitted by a GPU test system, and creating a custom resource instance corresponding to each GPU test task, wherein the custom resource instance corresponding to each GPU test task is used for describing the GPU test task; determining a target GPU test task which is required to be executed currently in the plurality of GPU test tasks, and creating a container for the target GPU test task by utilizing real-time schedulable resources according to a user-defined resource instance corresponding to the target GPU test task; starting to execute the target GPU test task by using the created container, and determining whether residual test cases which correspond to the target GPU test task and are not executed exist in the GPU test system; and under the condition that the residual test cases exist, carrying out container expansion on the target GPU test task by utilizing real-time schedulable resources according to the custom resource examples corresponding to the target GPU test task.
In one possible implementation manner, the custom resource instance corresponding to each GPU test task includes a task priority and a creation time corresponding to the GPU test task; the determining a target GPU test task that is currently required to be executed in the multiple GPU test tasks includes: constructing a task queue according to the task priority and the creation time corresponding to each GPU test task, wherein the task queue is used for indicating the execution sequence of the plurality of GPU test tasks; and determining the GPU test task at the head of the queue in the task queue as the target GPU test task.
In one possible implementation, the real-time schedulable resources include: kubernetes cluster real-time schedulable resources and public server cluster real-time schedulable resources; the method further comprises the steps of: monitoring resource information of each node in the Kubernetes cluster, and determining the resource quantity of real-time schedulable resources of the Kubernetes cluster; and monitoring the resource information of each node in the public server cluster, and determining the resource quantity of the real-time schedulable resources of the public server cluster.
In one possible implementation manner, the custom resource instance corresponding to each GPU test task includes the minimum number of containers required by the GPU test task and the amount of resources required by each container; the creating a container for the target GPU testing task by using real-time schedulable resources according to the custom resource instance corresponding to the target GPU testing task includes: determining the first container number created by the real-time schedulable resource support of the Kubernetes cluster and the second container number created by the real-time schedulable resource support of the public server according to the resource amount required by each container corresponding to the target GPU test task, the resource amount of the real-time schedulable resource of the Kubernetes cluster and the resource amount of the real-time schedulable resource of the public server cluster; and under the condition that the sum of the first container number and the second container number is larger than or equal to the minimum container number required by the target GPU test task, creating the container with the minimum container number for the target GPU test task by utilizing real-time schedulable resources.
In one possible implementation manner, the creating, for the target GPU test task, the container of the minimum container number using real-time schedulable resources if the sum of the first container number and the second container number is greater than or equal to the minimum container number required by the target GPU test task includes: and under the condition that the first container number is larger than or equal to the minimum container number, utilizing the Kubernetes cluster real-time schedulable resources to create the container with the minimum container number for the target GPU test task.
In one possible implementation manner, the creating, for the target GPU test task, the container of the minimum container number using real-time schedulable resources if the sum of the first container number and the second container number is greater than or equal to the minimum container number required by the target GPU test task includes: creating a container of the first container number for the target GPU test task using Kubernetes cluster real-time schedulable resources if the first container number is less than the minimum container number; and creating containers of a third container number for the target GPU testing task by utilizing a public server real-time schedulable resource, wherein the third container number is the difference value between the minimum container number and the first container number.
In one possible implementation manner, the custom resource instance corresponding to each GPU test task includes the number of single extension containers corresponding to the GPU test task; and when the residual test cases exist, performing container expansion for the target GPU test task by utilizing real-time schedulable resources according to the custom resource examples corresponding to the target GPU test task, wherein the method comprises the following steps: determining the fourth container number created by the real-time schedulable resource support of the Kubernetes cluster and the fifth container number created by the real-time schedulable resource support of the public server according to the resource amount required by each container corresponding to the target GPU test task, the resource amount of the real-time schedulable resource of the Kubernetes cluster and the resource amount of the real-time schedulable resource of the public server cluster; and under the condition that the number of the four containers is greater than 0 and/or the number of the fifth containers is greater than 0, utilizing the Kubernetes cluster real-time schedulable resources and/or the public server cluster real-time schedulable resources to expand the containers with the sixth number of containers for the target GPU test task, wherein the sixth number of containers is smaller than or equal to the number of single expanded containers corresponding to the target GPU test task.
In one possible implementation manner, the constructing a task queue according to the task priority and the creation time corresponding to each GPU test task includes: and sequencing the plurality of GPU test tasks according to the task priority from high to low and the creation time from first to last under the same task priority, so as to obtain the task queue.
In one possible implementation, the method further includes: after initiating execution of the target GPU test task, the target GPU test task is removed from the task queue.
In one possible implementation, the method further includes: and verifying the custom resource instance corresponding to each GPU test task based on a preset rule.
According to an aspect of the present disclosure, there is provided a GPU test task scheduling device, including: the system comprises a custom resource module, a user-defined resource module and a user-defined resource module, wherein the custom resource module is used for receiving a plurality of GPU test tasks submitted by a GPU test system and creating custom resource examples corresponding to each GPU test task, and each custom resource example corresponding to each GPU test task is used for describing the GPU test task; the scheduling module is used for determining a target GPU test task which is required to be executed currently in the plurality of GPU test tasks and creating a container for the target GPU test task by utilizing real-time schedulable resources according to a user-defined resource instance corresponding to the target GPU test task; the scheduling module is further used for controlling the created container to start to execute the target GPU test task and determining whether residual test cases which correspond to the target GPU test task and are not executed exist in the GPU test system; and the scheduling module is also used for performing container expansion for the target GPU test task by utilizing real-time schedulable resources according to the custom resource instance corresponding to the target GPU test task under the condition that the residual test cases exist.
According to an aspect of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
In the embodiment of the disclosure, a plurality of GPU test tasks submitted by a GPU test system are received, and a custom resource instance corresponding to each GPU test task is created, so that each GPU test task is effectively described based on the custom resource instance, a target GPU test task which is required to be executed currently in the plurality of GPU test tasks can be determined as long as schedulable resources exist, and a container is created for the target GPU test task by utilizing real-time schedulable resources according to the custom resource instance corresponding to the target GPU test task, so that the target GPU test task can be effectively started and executed by utilizing the created container, further, whether a residual test instance which is not executed and corresponds to the target GPU test task exists in the GPU test system is determined in the test process after the target GPU test task is started, and when the residual test instance exists, the container is created for the target GPU test task by utilizing the real-time schedulable resources according to the custom resource instance corresponding to the target GPU test task, so that the dynamic capacity of the target GPU test task is realized, and the resource utilization rate and the test efficiency are effectively improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.
Fig. 1 illustrates a flowchart of a GPU test task scheduling method, according to an embodiment of the present disclosure.
Fig. 2 illustrates a block diagram of a GPU test task scheduling device, according to an embodiment of the present disclosure.
Fig. 3 illustrates a block diagram of an electronic device, according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
A container is a form of operating system virtualization that can use one container to run all content from a small micro-service or software process to a large application, the container containing all necessary executable files, binaries, libraries, and configuration files. Kubernetes is an open-source container orchestration platform, and can automatically complete the creation, operation and destruction of containers in a cluster formed by a plurality of server nodes.
In the GPU testing process, an effective scheduling method has important influence on the utilization rate of computing resources and the execution efficiency of testing tasks. In the case of container scheduling using Kubernetes, a Kubernetes native scheduler is typically used for test task scheduling, as follows: 1) Submitting a test task resource, describing the number of containers to be created and the central processing unit (Central Processing Unit, CPU) resource and memory resource required by each container; 2) Kubernetes creates a fixed number of containers according to the configuration to perform the test task; 3) After the test task is completed, the container is destroyed.
The scheduling method based on the native Kubernetes has the following disadvantages: 1. only a fixed number of containers can be created, and the number of containers can not be dynamically adjusted according to the residual number of test cases and the resource utilization condition, so that the utilization rate of computing resources and the execution efficiency of test tasks are low; 2. the scheduling of the Kubernetes can only schedule the resources added into the Kubernetes cluster, the resources of some public servers cannot be added into the Kubernetes cluster, but the resources are in an idle state in some time periods and cannot be fully utilized, so that the resources required by the GPU test are short, and the situation of partial resource waste exists; 3. the Kubernetes requires a tester to manually check whether a test task is running and whether a free resource exists currently, and whether the task is submitted or not is considered, so that the efficiency of the mode is low, when one test task is finished, the next test task cannot be triggered in time, and the resource waste is caused by the free period of the resource.
In order to solve the technical problems, the embodiment of the disclosure provides a method for scheduling GPU test tasks, which can effectively realize dynamic expansion of GPU test tasks and improve resource utilization rate and test efficiency. The GPU test task scheduling method provided by the embodiments of the present disclosure is described in detail below.
Fig. 1 illustrates a flowchart of a GPU test task scheduling method, according to an embodiment of the present disclosure. The method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, etc., and the method may be implemented by a processor invoking computer readable instructions stored in a memory. Or the method may be performed by a server. As shown in fig. 1, the method includes:
In step S11, a plurality of GPU test tasks submitted by the GPU test system are received, and a custom resource instance corresponding to each GPU test task is created, where the custom resource instance corresponding to each GPU test task is used to describe the GPU test task.
The GPU test system is a system for performing GPU test, and comprises a plurality of GPU test tasks and test cases corresponding to each GPU test task. The specific forms of the GPU test system, the GPU test task and the test case can be flexibly set according to actual conditions, and the disclosure does not specifically limit the present disclosure.
Aiming at a plurality of GPU test tasks which are submitted by a GPU test system and need to be executed, a corresponding custom resource instance is created for each GPU test task, so that each GPU test task is effectively described.
For any one GPU test task, the custom resource instance corresponding to the GPU test task may be used to describe the GPU test task, for example, describe the task name, creation time, and required resources of the GPU test task. The details of the custom resource instance will be described in detail below in connection with possible implementation manners of the embodiments of the present disclosure, which are not described herein.
In step S12, determining a target GPU test task that needs to be executed currently from the multiple GPU test tasks, and creating a container for the target GPU test task by using the real-time schedulable resource according to the custom resource instance corresponding to the target GPU test task.
As long as the schedulable resources exist currently, a target GPU test task which is required to be executed currently in the plurality of GPU test tasks can be determined, and a container is created for the target GPU test task by utilizing the real-time schedulable resources according to the user-defined resource instance corresponding to the target GPU test task, so that the created container can be utilized to effectively start and execute the target GPU test task.
In an example, efficiently initiating execution of a target GPU test task with a created container includes: based on the created container, acquiring a test case corresponding to the target GPU test task from the GPU test system through an HTTP request.
In one example, the GPU testing system records test cases corresponding to the target GPU testing tasks that have been pulled by the created container. For example, the state of the test case corresponding to the target GPU test task that has been pulled by the created container is set to be executing, and the state of the test case corresponding to the target GPU test task that has not been pulled by the created container is set to be unexecuted.
In an example, after the created container has executed one test case corresponding to the target GPU test task, the test result corresponding to the test case is returned to the GPU test system, and then the GPU test system may set the state of the test case to be completed.
The specific process of how to determine the target GPU test task that needs to be executed currently in the multiple GPU test tasks and how to create a container for the target GPU test task by using the real-time schedulable resources according to the custom resource instance corresponding to the target GPU test task will be described in detail later in connection with the possible implementation manner of the present disclosure, and will not be described in detail here.
In step S13, the execution of the target GPU test task is started by using the created container, and it is determined whether there are remaining test cases corresponding to the target GPU test task that are not executed in the GPU test system.
In the testing process after the built container is utilized to start executing the target GPU testing task, whether residual testing cases corresponding to the target GPU testing task exist in the GPU testing system or not can be determined, and whether capacity expansion is needed for the target GPU testing task or not is judged.
In an example, whether the remaining test cases corresponding to the target GPU test tasks are not executed in the GPU test system may be determined by determining whether the test cases corresponding to the target GPU test tasks in the GPU test system are not executed.
In step S14, when there are remaining test cases, according to the custom resource instance corresponding to the target GPU test task, the real-time schedulable resource is utilized to perform container extension for the target GPU test task.
When the residual test cases exist, the real-time schedulable resources can be utilized to perform container expansion for the target GPU test tasks according to the custom resource examples corresponding to the target GPU test tasks, so that dynamic expansion of the target GPU test tasks is realized, and the resource utilization rate and the test efficiency are effectively improved.
The specific process of how to utilize real-time schedulable resources for container expansion for the target GPU test task will be described in detail below in connection with possible implementations of the present disclosure, and will not be described in detail here.
In the embodiment of the disclosure, a plurality of GPU test tasks submitted by a GPU test system are received, and a custom resource instance corresponding to each GPU test task is created, so that each GPU test task is effectively described based on the custom resource instance, a target GPU test task which is required to be executed currently in the plurality of GPU test tasks can be determined as long as schedulable resources exist, and a container is created for the target GPU test task by utilizing real-time schedulable resources according to the custom resource instance corresponding to the target GPU test task, so that the target GPU test task can be effectively started and executed by utilizing the created container, further, whether a residual test instance which is not executed and corresponds to the target GPU test task exists in the GPU test system is determined in the test process after the target GPU test task is started, and when the residual test instance exists, the container is created for the target GPU test task by utilizing the real-time schedulable resources according to the custom resource instance corresponding to the target GPU test task, so that the dynamic capacity of the target GPU test task is realized, and the resource utilization rate and the test efficiency are effectively improved.
In an example, a corresponding custom resource instance is created for each GPU test task based on the custom resource module. The custom resource module includes a data structure ATFJobScaler, ATFJobScaler for describing GPU test tasks that registers as an extended custom resource in Kubernetes, ATFJobScaler creates a custom resource instance for each GPU test task, and stores the custom resource instance corresponding to each GPU test task in an ETD (External Time-series Database) Database in Kubernetes for storing and managing Time-series data.
In an example, for any one GPU test task, the custom resource instance corresponding to the GPU test task may include a plurality of fields: a. task Name (Name); b. task Priority (Priority); c. creation Time (Creation Time); d. a minimum number of containers (MIN REPLICAS); e. a maximum number of containers (Max Replicas); f. the number of single expansion vessels created per expansion (STEP REPLICAS); g. environmental variable (Env); h. the number of CPUs (CPU requests) required for each container; i. the number of GPUs (GPU requests) required per container; j. the Memory size (Memory Request) required for each container; k. current task Status (Status); l, number of containers currently running (Running Replicas). The custom resource instance corresponding to the GPU test task may include the above information, and other information may be set according to actual situations, which is not specifically limited in this disclosure.
In one possible implementation, the method further includes: and verifying the custom resource instance corresponding to each GPU test task based on a preset rule.
And checking a custom resource instance corresponding to any GPU test task based on a preset rule so as to ensure that the custom resource instance can correctly describe the GPU test task.
In an example, the custom resource module further includes a custom resource verification program, based on the custom resource verification program, and each field in the custom resource instance corresponding to the GPU test task is verified based on a preset rule.
In one example, one Kubernetes Controller is implemented based on Kube Builder to monitor creation, updating, and deletion of custom resource instances corresponding to GPU test tasks. Triggering a custom resource check program to check when Kubernetes Controller monitors that a custom resource instance corresponding to a GPU test task is created.
In an example, the preset rules may include:
1) The minimum number of containers is greater than 0 and less than the maximum number of containers;
2) The number of the single expansion containers created by each expansion is more than 0 and less than the difference value between the maximum container number and the minimum container number;
3) Resetting the maximum container number to be equal to the total number of the test cases corresponding to the GPU test tasks under the condition that the maximum container number is larger than the total number of the test cases corresponding to the GPU test tasks;
4) The number of CPUs required per container is greater than 0;
5) The number of GPUs required per container is greater than 0;
6) The memory size required for each container is greater than 0.
For any one GPU test task, after the custom resource instance corresponding to the GPU test task passes the verification, the current task state included in the custom resource instance corresponding to the GPU test task can be set to be waiting.
In one possible implementation manner, the custom resource instance corresponding to each GPU test task includes a task priority and a creation time corresponding to the GPU test task; determining a target GPU test task which is required to be executed currently in a plurality of GPU test tasks comprises the following steps: constructing a task queue according to the task priority and the creation time corresponding to each GPU test task, wherein the task queue is used for indicating the execution sequence of a plurality of GPU test tasks; and determining the GPU test task at the head of the queue in the task queue as a target GPU test task.
For a plurality of GPU test tasks in waiting state after verification, the execution sequence of the plurality of GPU test tasks can be determined according to the task priority and the creation time corresponding to each GPU test task, and then a task queue is constructed according to the execution sequence of the plurality of GPU test tasks, so that each GPU test task in the task queue can be effectively and automatically scheduled in the follow-up state, and the automatic test efficiency is improved.
In one possible implementation, constructing a task queue according to a task priority and a creation time corresponding to each GPU test task includes: and sequencing the plurality of GPU test tasks according to the task priority from high to low and the creation time from first to last under the same task priority, so as to obtain a task queue.
For a plurality of GPU test tasks with current task states in waiting after verification, according to task priorities and creation time corresponding to each GPU test task, firstly sequencing the tasks according to the priorities of the tasks from high to low, and further sequencing the GPU test tasks with the same task priorities according to the creation time from first to last, so as to finally obtain a task queue.
Because the plurality of GPU test tasks in the task queue are sequentially arranged according to the execution sequence, the GPU test task positioned at the head of the queue in the task queue is always determined to be the target GPU test task to be executed currently in the scheduling process.
In one possible implementation, the real-time schedulable resources include: kubernetes cluster real-time schedulable resources and public server cluster real-time schedulable resources; the method further comprises the steps of: monitoring resource information of each node in the Kubernetes cluster, and determining the resource quantity of real-time schedulable resources of the Kubernetes cluster; and monitoring the resource information of each node in the public server cluster, and determining the resource quantity of the real-time schedulable resources of the public server cluster.
The real-time schedulable resources for GPU testing can comprise the Kubernetes cluster real-time schedulable resources and the public server cluster real-time schedulable resources, and can also comprise the public server cluster real-time schedulable resources, so that more schedulable resources are utilized for GPU testing, GPU testing efficiency can be effectively improved, and cost is reduced.
In an example, a go-client of Kubernetes may be connected to a Kubernetes cluster to obtain resource information of each node in the Kubernetes cluster, so as to obtain a resource amount of real-time schedulable resources of the Kubernetes cluster.
In an example, the resource amount of the Kubernetes cluster real-time schedulable resource may include: the number of the Kubernetes cluster real-time schedulable CPUs, the number of the Kubernetes cluster real-time schedulable GPUs and the size of the Kubernetes cluster real-time schedulable memory.
In an example, for any node i in the Kubernetes cluster, the following resource information may be obtained: a) Total CPU number of node i (Available CPU); b) The number of used CPUs for node i (Requested CPU); c) Total GPU number of node i (Available GPU); d) Number of used GPUs for node i (Requested GPUs); e) The total Memory size (Available Memory) of node i; f) The used Memory size (Requested Memory) of node i. Further, determining a resource amount of the real-time schedulable resource of the node i: a) Real-time schedulable CPU number of node i = total CPU number of node i-used CPU number of node i; b) Real-time schedulable GPU number of node i = total GPU number of node i-used GPU number of node i; c) Real-time schedulable memory size of node i = total memory size of node i-used memory size of node i. Finally, determining the resource quantity of the real-time schedulable resources of the Kubernetes cluster: a) The number of real-time schedulable CPUs of Kubernetes cluster (Idle CPU) =sum of the number of real-time schedulable CPUs of each node in Kubernetes cluster; b) Real-time schedulable GPU number of Kubernetes cluster (Idle GPU) =sum of real-time schedulable GPU numbers of each node in Kubernetes cluster; c) Real-time schedulable Memory size of Kubernetes cluster (Idle Memory) =sum of real-time schedulable Memory sizes of each node in Kubernetes cluster.
In an example, a common server cluster may include resources where the active period is in a non-idle state and the non-active period is in an idle state. The common server cluster resources may be centrally managed by a resource management system (DUT), and furthermore, the DUT may provide an API for acquiring common server cluster real-time schedulable resources and creating a container to run GPU test tasks based on the common server cluster real-time schedulable resources.
In an example, an HTTP request may be sent based on an API provided by the DUT to request acquisition of real-time schedulable resources to manage the common server cluster.
In an example, the common server cluster real-time schedulable resources may include: the number of real-time schedulable CPUs of the common server cluster (Idle CPU '), the number of real-time schedulable GPUs of the common server cluster (Idle GPU '), and the size of real-time schedulable Memory of the common server cluster (Idle Memory ').
In one possible implementation manner, the custom resource instance corresponding to each GPU test task includes the minimum number of containers required by the GPU test task and the amount of resources required by each container; creating a container for the target GPU test task by utilizing real-time schedulable resources according to the custom resource instance corresponding to the target GPU test task, including: determining the number of first containers created by the real-time schedulable resource support of the Kubernetes cluster and the number of second containers created by the real-time schedulable resource support of the public server according to the amount of resources required by each container corresponding to the target GPU test task, the amount of resources of the real-time schedulable resource of the Kubernetes cluster and the amount of resources of the real-time schedulable resource of the public server cluster; and under the condition that the sum of the first container number and the second container number is larger than or equal to the minimum container number required by the target GPU test task, creating the container with the minimum container number for the target GPU test task by utilizing the real-time schedulable resource.
After determining the target GPU test task that is currently needed to be executed and determining the current real-time schedulable resource, it may be determined whether the current real-time schedulable resource is capable of supporting the execution of the target GPU test task and, when capable of supporting, effectively creating a container to start the execution of the target GPU test task.
First, determining the number of first containers created by the real-time schedulable resource support of the Kubernetes cluster according to the resource amount required by each container corresponding to the target GPU test task and the resource amount of the real-time schedulable resource of the Kubernetes cluster.
For example, the first container number N1 for creation of Kubernetes cluster real-time schedulable resource support is determined according to the number of Kubernetes cluster real-time schedulable CPUs (Idle CPUs), the number of Kubernetes cluster real-time schedulable GPUs (Idle GPUs), the size of Kubernetes cluster real-time schedulable memories (Idle Memory), the number of CPUs required for each container corresponding to the target GPU test task (CPU Request), the number of GPUs required (GPU Request), the size of memories required (Memory Request) by the following formula (1):
and secondly, determining the number of second containers created by the real-time schedulable resource support of the public server cluster according to the resource amount required by each container corresponding to the target GPU test task and the resource amount of the real-time schedulable resource of the public server cluster.
For example, the second container number N2 created by the public server cluster real-time schedulable resource support is determined according to the public server cluster real-time schedulable CPU number (Idle CPU '), the public server cluster real-time schedulable GPU number (Idle GPU '), the public server cluster real-time schedulable Memory size (Idle Memory '), the CPU number required for each container corresponding to the target GPU test task (CPU Request), the required GPU number (GPU Request), the required Memory size (Memory Request) by the following formula (2):
And then, judging whether the sum of the first container number and the second container number is larger than or equal to the minimum container number required by the target GPU test task.
If the sum of the first container number and the second container number is smaller than the minimum container number required by the target GPU test task, the current real-time schedulable resource cannot support to start and execute the target GPU test task. At this point, monitoring of the real-time schedulable resources continues.
If the sum of the first container number and the second container number is greater than or equal to the minimum container number required by the target GPU test task, the current real-time schedulable resource can support to start and execute the target GPU test task. At this point, a minimum number of containers are created for the target GPU test task based on the real-time schedulable resources.
In one possible implementation, in a case where the sum of the first container number and the second container number is greater than or equal to a minimum container number required by the target GPU test task, creating a container of the minimum container number for the target GPU test task using the real-time schedulable resource includes: and under the condition that the first container number is larger than or equal to the minimum container number, utilizing the Kubernetes cluster real-time schedulable resources to create the container with the minimum container number for the target GPU test task.
When the first container number is greater than or equal to the minimum container number, the real-time schedulable resources of the Kubernetes cluster alone can support the start-up and execution of the target GPU test task, and at this time, containers with the minimum container number are created for the target GPU test task by preferentially utilizing the real-time schedulable resources of the Kubernetes cluster specially used for executing the GPU test task.
For example, a Kubernetes-based go-client creates a minimum container number (MIN REPLICAS) of containers using Kubernetes cluster real-time schedulable resources.
In one possible implementation, in a case where the sum of the first container number and the second container number is greater than or equal to a minimum container number required by the target GPU test task, creating a container of the minimum container number for the target GPU test task using the real-time schedulable resource includes: under the condition that the first container number is smaller than the minimum container number, utilizing the Kubernetes cluster real-time schedulable resources to create containers of the first container number for the target GPU test task; and creating containers of a third container number for the target GPU testing task by utilizing the real-time schedulable resources of the common server, wherein the third container number is the difference value between the minimum container number and the first container number.
And under the condition that the first container number is smaller than the minimum container number, the independent Kubernetes cluster real-time schedulable resources cannot support the start-up execution of the target GPU test task, and at the moment, the Kubernetes cluster real-time schedulable resources and the common server real-time schedulable resources are comprehensively utilized to create containers with the minimum container number for the target GPU test task.
For example, a Kubernetes-based go-client creates a container of a first container number (N1) using Kubernetes cluster real-time schedulable resources; a third number of containers (N3) of containers is created using the common server real-time schedulable resources based on the API provided by the DUT, where N3=Min Replicas-N1.
After creating a minimum number of containers for the target GPU test task, the created containers are used to initiate execution of the target GPU test task.
In one possible implementation, the method further includes: after initiating execution of the target GPU test task, the target GPU test task is removed from the task queue.
After the execution of the target GPU test task is started, setting the current task state included in the custom resource instance corresponding to the target GPU test task as the executing state, and removing the target GPU test task from the task queue, so that the current task state positioned at the head of the queue in the task queue is always the GPU test task which is waiting and has the highest execution sequence.
And starting to execute the target GPU test task by using the created container aiming at the target GPU test task, pulling the test case corresponding to the target GPU test task from the GPU test system, executing the pulled test case in the created container, and returning the test result to the GPU test system.
In the process of executing the target GPU test task, judging whether residual test cases which correspond to the target GPU test task and are not executed exist in the GPU test system.
If the residual test cases corresponding to the target GPU test task do not exist, the fact that all the test cases corresponding to the target GPU test task are executed in the created container is indicated. At this time, whether the test result of each test case corresponding to the target GPU test task is obtained in the GPU test system is further monitored. If yes, the target GPU testing task is executed, and at the moment, all containers created for the target GPU testing task are stopped and released to release resources.
If the residual test cases corresponding to the target GPU test task exist, the fact that the target GPU test task is not executed is indicated to be completed, and fewer containers are created for the target GPU test task currently. At this time, according to the current real-time schedulable resource, it is determined whether container expansion is supported for the target GPU test task to improve the execution efficiency of the target GPU test task.
In one possible implementation manner, the custom resource instance corresponding to each GPU test task includes the number of single extension containers corresponding to the GPU test task; when the residual test cases exist, according to the custom resource examples corresponding to the target GPU test tasks, performing container expansion for the target GPU test tasks by using real-time schedulable resources, including: determining the fourth container number created by the real-time schedulable resource support of the Kubernetes cluster and the fifth container number created by the real-time schedulable resource support of the public server according to the resource amount required by each container corresponding to the target GPU test task, the resource amount of the real-time schedulable resource of the Kubernetes cluster and the resource amount of the real-time schedulable resource of the public server cluster; and under the condition that the number of the four containers is greater than 0 and/or the number of the fifth containers is greater than 0, utilizing the Kubernetes cluster real-time schedulable resources and/or the common server cluster real-time schedulable resources to expand the containers with the number of the sixth containers for the target GPU test task, wherein the number of the sixth containers is smaller than or equal to the number of the single expanded containers corresponding to the target GPU test task.
When the residual test cases exist in the target GPU test task, judging whether the current real-time schedulable resource can support starting to perform container expansion for the target GPU test task, and effectively performing container expansion for the target GPU test task when the current real-time schedulable resource can support the starting to perform container expansion for the target GPU test task so as to improve execution efficiency.
Firstly, determining the number of fourth containers created by the real-time schedulable resource support of the Kubernetes cluster according to the resource amount required by each container corresponding to the target GPU test task and the resource amount of the real-time schedulable resource of the Kubernetes cluster. For the specific process, reference may be made to the above process of determining the number of first containers, which is not described herein.
And secondly, determining the number of fifth containers created by the real-time schedulable resource support of the public server cluster according to the resource amount required by each container corresponding to the target GPU test task and the resource amount of the real-time schedulable resource of the public server cluster. For the specific process, reference may be made to the above process of determining the number of second containers, which is not described herein.
Then, it is determined whether the fourth container number and/or the fifth container number is greater than 0.
If the number of the fourth containers and the number of the fifth containers are not greater than 0, the current real-time schedulable resource cannot support container expansion for the target GPU test task. At this point, monitoring of the real-time schedulable resources continues.
If the fourth container number and/or the fifth container number is greater than 0, it indicates that the current real-time schedulable resource can support container expansion for the target GPU test task. At this time, the sixth number of containers is expanded for the target GPU test task, where the sixth number of containers is less than or equal to the number of single expanded containers corresponding to the target GPU test task.
In an example, multiple container extensions may be performed for a target GPU test task, but it is desirable to ensure that the number of container extensions per container extension is less than or equal to the number of single-extension containers corresponding to the target GPU test task (STEP REPLICAS)), and that the total number of extended containers corresponding to the target GPU test task is less than or equal to the maximum number of containers corresponding to the target GPU test task (Max Replicas). The specific number of extension times can be flexibly set according to actual conditions, and the present disclosure does not specifically limit this.
It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.
In addition, the disclosure further provides a GPU test task scheduling device, an electronic device, a computer readable storage medium, and a program, which can be used to implement any one of the GPU test task scheduling methods provided in the disclosure, and the corresponding technical schemes and descriptions and corresponding descriptions of the method parts are omitted.
Fig. 2 illustrates a block diagram of a GPU test task scheduling device, according to an embodiment of the present disclosure. As shown in fig. 2, the apparatus 20 includes:
The custom resource module 21 is configured to receive a plurality of GPU test tasks submitted by the GPU test system, and create custom resource instances corresponding to each GPU test task, where the custom resource instances corresponding to each GPU test task are used to describe the GPU test task;
the scheduling module 22 is configured to determine a target GPU test task that needs to be executed currently from the multiple GPU test tasks, and create a container for the target GPU test task by using real-time schedulable resources according to a custom resource instance corresponding to the target GPU test task;
The scheduling module 22 is further configured to control the created container to start and execute the target GPU test task, and determine whether there are remaining test cases corresponding to the target GPU test task that are not executed in the GPU test system;
The scheduling module 23 is further configured to perform container expansion for the target GPU test task according to the custom resource instance corresponding to the target GPU test task when the remaining test cases exist, by using the real-time schedulable resource.
In one possible implementation manner, the custom resource instance corresponding to each GPU test task includes a task priority and a creation time corresponding to the GPU test task;
the scheduling module 22 is specifically configured to:
Constructing a task queue according to the task priority and the creation time corresponding to each GPU test task, wherein the task queue is used for indicating the execution sequence of a plurality of GPU test tasks;
And determining the GPU test task at the head of the queue in the task queue as a target GPU test task.
In one possible implementation, the real-time schedulable resources include: kubernetes cluster real-time schedulable resources and public server cluster real-time schedulable resources;
the scheduling module 22 is specifically configured to:
Monitoring resource information of each node in the Kubernetes cluster, and determining the resource quantity of real-time schedulable resources of the Kubernetes cluster;
and monitoring the resource information of each node in the public server cluster, and determining the resource quantity of the real-time schedulable resources of the public server cluster.
In one possible implementation manner, the custom resource instance corresponding to each GPU test task includes the minimum number of containers required by the GPU test task and the amount of resources required by each container;
the scheduling module 22 is specifically configured to:
Determining the number of first containers created by the real-time schedulable resource support of the Kubernetes cluster and the number of second containers created by the real-time schedulable resource support of the public server according to the amount of resources required by each container corresponding to the target GPU test task, the amount of resources of the real-time schedulable resource of the Kubernetes cluster and the amount of resources of the real-time schedulable resource of the public server cluster;
And under the condition that the sum of the first container number and the second container number is larger than or equal to the minimum container number required by the target GPU test task, creating the container with the minimum container number for the target GPU test task by utilizing the real-time schedulable resource.
In one possible implementation, the scheduling module 22 is specifically configured to:
And under the condition that the first container number is larger than or equal to the minimum container number, utilizing the Kubernetes cluster real-time schedulable resources to create the container with the minimum container number for the target GPU test task.
In one possible implementation, the scheduling module 22 is specifically configured to:
Under the condition that the first container number is smaller than the minimum container number, utilizing the Kubernetes cluster real-time schedulable resources to create containers of the first container number for the target GPU test task;
And creating containers of a third container number for the target GPU testing task by utilizing the real-time schedulable resources of the common server, wherein the third container number is the difference value between the minimum container number and the first container number.
In one possible implementation manner, the custom resource instance corresponding to each GPU test task includes the number of single extension containers corresponding to the GPU test task;
the scheduling module 22 is specifically configured to:
Determining the fourth container number created by the real-time schedulable resource support of the Kubernetes cluster and the fifth container number created by the real-time schedulable resource support of the public server according to the resource amount required by each container corresponding to the target GPU test task, the resource amount of the real-time schedulable resource of the Kubernetes cluster and the resource amount of the real-time schedulable resource of the public server cluster;
And under the condition that the number of the four containers is greater than 0 and/or the number of the fifth containers is greater than 0, utilizing the Kubernetes cluster real-time schedulable resources and/or the common server cluster real-time schedulable resources to expand the containers with the number of the sixth containers for the target GPU test task, wherein the number of the sixth containers is smaller than or equal to the number of the single expanded containers corresponding to the target GPU test task.
In one possible implementation, the scheduling module 22 is specifically configured to:
And sequencing the plurality of GPU test tasks according to the task priority from high to low and the creation time from first to last under the same task priority, so as to obtain a task queue.
In one possible implementation, the scheduling module 22 is specifically configured to:
After initiating execution of the target GPU test task, the target GPU test task is removed from the task queue.
In one possible implementation, the scheduling module 22 is specifically configured to:
And verifying the custom resource instance corresponding to each GPU test task based on a preset rule.
The method has specific technical association with the internal structure of the computer system, and can solve the technical problems of improving the hardware operation efficiency or the execution effect (including reducing the data storage amount, reducing the data transmission amount, improving the hardware processing speed and the like), thereby obtaining the technical effect of improving the internal performance of the computer system which accords with the natural law.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.
The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.
Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.
The electronic device may be provided as a terminal, server or other form of device.
Fig. 3 illustrates a block diagram of an electronic device, according to an embodiment of the present disclosure. Referring to fig. 3, an electronic device 1900 may be provided as a server or terminal device. Referring to FIG. 3, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as the Microsoft Server operating system (Windows ServerTM), the apple Inc. promoted graphical user interface-based operating system (Mac OS XTM), the multi-user, multi-process computer operating system (UnixTM), the free and open source Unix-like operating system (LinuxTM), the open source Unix-like operating system (FreeBSDTM), or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (13)

Translated fromChinese
1.一种GPU测试任务调度方法,其特征在于,包括:1. A GPU test task scheduling method, characterized by comprising:接收GPU测试系统提交的多个GPU测试任务,以及创建每个GPU测试任务对应的自定义资源实例,其中,每个GPU测试任务对应的自定义资源实例用于描述该GPU测试任务,每个GPU测试任务对应的自定义资源实例中包括该GPU测试任务对应的单次扩建容器数量;Receive multiple GPU test tasks submitted by the GPU test system, and create a custom resource instance corresponding to each GPU test task, wherein the custom resource instance corresponding to each GPU test task is used to describe the GPU test task, and the custom resource instance corresponding to each GPU test task includes the number of single expansion containers corresponding to the GPU test task;确定所述多个GPU测试任务中当前需要执行的目标GPU测试任务,以及根据所述目标GPU测试任务对应的自定义资源实例,利用实时可调度资源为所述目标GPU测试任务创建容器;Determine a target GPU test task that currently needs to be executed among the multiple GPU test tasks, and create a container for the target GPU test task using real-time schedulable resources according to a custom resource instance corresponding to the target GPU test task;利用已创建容器启动执行所述目标GPU测试任务,以及确定所述GPU测试系统中是否存在所述目标GPU测试任务对应的未被执行的剩余测试用例;Using the created container to start executing the target GPU test task, and determining whether there are any remaining unexecuted test cases corresponding to the target GPU test task in the GPU test system;当存在所述剩余测试用例的情况下,根据所述目标GPU测试任务对应的自定义资源实例,利用实时可调度资源为所述目标GPU测试任务进行容器扩建,其中,每次为所述目标GPU测试任务进行容器扩建的数量小于等于所述目标GPU测试任务对应的单次扩建容器数量。When there are remaining test cases, based on the custom resource instance corresponding to the target GPU test task, real-time schedulable resources are used to expand the container for the target GPU test task, wherein the number of containers expanded for the target GPU test task each time is less than or equal to the number of single-expanded containers corresponding to the target GPU test task.2.根据权利要求1所述的方法,其特征在于,每个GPU测试任务对应的自定义资源实例中包括该GPU测试任务对应的任务优先级、创建时间;2. The method according to claim 1, characterized in that the custom resource instance corresponding to each GPU test task includes the task priority and creation time corresponding to the GPU test task;所述确定所述多个GPU测试任务中当前需要执行的目标GPU测试任务,包括:The determining of a target GPU test task that needs to be currently executed among the multiple GPU test tasks includes:根据每个GPU测试任务对应的任务优先级、创建时间,构建任务队列,其中,所述任务队列用于指示所述多个GPU测试任务的执行顺序;Constructing a task queue according to the task priority and creation time corresponding to each GPU test task, wherein the task queue is used to indicate the execution order of the multiple GPU test tasks;将所述任务队列中位于队首的GPU测试任务,确定为所述目标GPU测试任务。The GPU test task at the head of the task queue is determined as the target GPU test task.3.根据权利要求1所述的方法,其特征在于,实时可调度资源包括:Kubernetes集群实时可调度资源、公共服务器集群实时可调度资源;3. The method according to claim 1 is characterized in that the real-time schedulable resources include: real-time schedulable resources of a Kubernetes cluster and real-time schedulable resources of a public server cluster;所述方法还包括:The method further comprises:监控Kubernetes集群中各节点的资源信息,确定Kubernetes集群实时可调度资源的资源量;Monitor the resource information of each node in the Kubernetes cluster and determine the amount of resources that can be scheduled in real time in the Kubernetes cluster;监控公共服务器集群中各节点的资源信息,确定公共服务器集群实时可调度资源的资源量。Monitor the resource information of each node in the public server cluster and determine the amount of resources that can be scheduled in real time in the public server cluster.4.根据权利要求3所述的方法,其特征在于,每个GPU测试任务对应的自定义资源实例中包括该GPU测试任务所需的最小容器数量、以及每个容器所需的资源量;4. The method according to claim 3, characterized in that the custom resource instance corresponding to each GPU test task includes the minimum number of containers required for the GPU test task and the amount of resources required for each container;所述根据所述目标GPU测试任务对应的自定义资源实例,利用实时可调度资源为所述目标GPU测试任务创建容器,包括:The step of creating a container for the target GPU test task using real-time schedulable resources according to the custom resource instance corresponding to the target GPU test task includes:根据所述目标GPU测试任务对应的每个容器所需的资源量、Kubernetes集群实时可调度资源的资源量、以及公共服务器集群实时可调度资源的资源量,确定Kubernetes集群实时可调度资源支持创建的第一容器数量,以及公共服务器实时可调度资源支持创建的第二容器数量;Determine the number of first containers that can be created by the real-time schedulable resources of the Kubernetes cluster and the number of second containers that can be created by the real-time schedulable resources of the public server according to the amount of resources required for each container corresponding to the target GPU test task, the amount of resources of the real-time schedulable resources of the Kubernetes cluster, and the amount of resources of the real-time schedulable resources of the public server cluster;在所述第一容器数量与所述第二容器数量之和,大于等于所述目标GPU测试任务所需的最小容器数量的情况下,利用实时可调度资源为所述目标GPU测试任务创建所述最小容器数量的容器。When the sum of the first number of containers and the second number of containers is greater than or equal to the minimum number of containers required by the target GPU test task, real-time schedulable resources are used to create containers with the minimum number of containers for the target GPU test task.5.根据权利要求4所述的方法,其特征在于,所述在所述第一容器数量与所述第二容器数量之和,大于等于所述目标GPU测试任务所需的最小容器数量的情况下,利用实时可调度资源为所述目标GPU测试任务创建所述最小容器数量的容器,包括:5. The method according to claim 4, characterized in that when the sum of the first number of containers and the second number of containers is greater than or equal to the minimum number of containers required by the target GPU test task, using real-time schedulable resources to create containers with the minimum number of containers for the target GPU test task, comprises:在所述第一容器数量大于等于所述最小容器数量的情况下,利用Kubernetes集群实时可调度资源,为所述目标GPU测试任务创建所述最小容器数量的容器。When the first number of containers is greater than or equal to the minimum number of containers, the real-time schedulable resources of the Kubernetes cluster are used to create containers with the minimum number of containers for the target GPU test task.6.根据权利要求4所述的方法,其特征在于,所述在所述第一容器数量与所述第二容器数量之和,大于等于所述目标GPU测试任务所需的最小容器数量的情况下,利用实时可调度资源为所述目标GPU测试任务创建所述最小容器数量的容器,包括:6. The method according to claim 4, characterized in that when the sum of the first number of containers and the second number of containers is greater than or equal to the minimum number of containers required by the target GPU test task, using real-time schedulable resources to create containers with the minimum number of containers for the target GPU test task, comprises:在所述第一容器数量小于所述最小容器数量的情况下,利用Kubernetes集群实时可调度资源,为所述目标GPU测试任务创建所述第一容器数量的容器;When the first number of containers is less than the minimum number of containers, using the real-time schedulable resources of the Kubernetes cluster to create the first number of containers for the target GPU test task;利用公共服务器实时可调度资源,为所述目标GPU测试任务创建第三容器数量的容器,其中,所述第三容器数量为所述最小容器数量与所述第一容器数量的差值。A third number of containers is created for the target GPU test task by using real-time schedulable resources of a public server, wherein the third number of containers is a difference between the minimum number of containers and the first number of containers.7.根据权利要求4所述的方法,其特征在于,所述当存在所述剩余测试用例的情况下,根据所述目标GPU测试任务对应的自定义资源实例,利用实时可调度资源为所述目标GPU测试任务进行容器扩建,包括:7. The method according to claim 4, characterized in that when the remaining test cases exist, according to the custom resource instance corresponding to the target GPU test task, using real-time schedulable resources to expand the container for the target GPU test task, comprising:根据所述目标GPU测试任务对应的每个容器所需的资源量、Kubernetes集群实时可调度资源的资源量、以及公共服务器集群实时可调度资源的资源量,确定Kubernetes集群实时可调度资源支持创建的第四容器数量,以及公共服务器实时可调度资源支持创建的第五容器数量;Determine the number of fourth containers that the real-time schedulable resources of the Kubernetes cluster support for creation, and the number of fifth containers that the real-time schedulable resources of the public server support for creation, according to the amount of resources required for each container corresponding to the target GPU test task, the amount of resources of the real-time schedulable resources of the Kubernetes cluster, and the amount of resources of the real-time schedulable resources of the public server cluster;在所述四容器数量大于0和/或所述第五容器数量大于0的情况下,利用Kubernetes集群实时可调度资源和/或公共服务器集群实时可调度资源,为所述目标GPU测试任务扩建第六容器数量的容器,其中,所述第六容器数量小于等于所述目标GPU测试任务对应的单次扩建容器数量。When the number of the four containers is greater than 0 and/or the number of the fifth container is greater than 0, the real-time schedulable resources of the Kubernetes cluster and/or the real-time schedulable resources of the public server cluster are utilized to expand the sixth number of containers for the target GPU test task, wherein the sixth number of containers is less than or equal to the single expansion number of containers corresponding to the target GPU test task.8.根据权利要求2所述的方法,其特征在于,所述根据每个GPU测试任务对应的任务优先级、创建时间,构建任务队列,包括:8. The method according to claim 2, characterized in that the step of constructing a task queue according to the task priority and creation time corresponding to each GPU test task comprises:按照任务优先级由高到低、以及相同任务优先级下创建时间由先到后,对所述多个GPU测试任务进行排序,得到所述任务队列。The multiple GPU test tasks are sorted according to the task priorities from high to low and the creation time from early to late under the same task priority to obtain the task queue.9.根据权利要求2所述的方法,其特征在于,所述方法还包括:9. The method according to claim 2, characterized in that the method further comprises:在启动执行所述目标GPU测试任务之后,将所述目标GPU测试任务从所述任务队列中移除。After starting to execute the target GPU test task, the target GPU test task is removed from the task queue.10.根据权利要求1所述的方法,其特征在于,所述方法还包括:10. The method according to claim 1, characterized in that the method further comprises:基于预设规则,对每个GPU测试任务对应的自定义资源实例进行校验。Based on preset rules, the custom resource instance corresponding to each GPU test task is verified.11.一种GPU测试任务调度装置,其特征在于,包括:11. A GPU test task scheduling device, comprising:自定义资源模块,用于接收GPU测试系统提交的多个GPU测试任务,以及创建每个GPU测试任务对应的自定义资源实例,其中,每个GPU测试任务对应的自定义资源实例用于描述该GPU测试任务,每个GPU测试任务对应的自定义资源实例中包括该GPU测试任务对应的单次扩建容器数量;A custom resource module is used to receive multiple GPU test tasks submitted by the GPU test system, and to create a custom resource instance corresponding to each GPU test task, wherein the custom resource instance corresponding to each GPU test task is used to describe the GPU test task, and the custom resource instance corresponding to each GPU test task includes the number of single expansion containers corresponding to the GPU test task;调度模块,用于确定所述多个GPU测试任务中当前需要执行的目标GPU测试任务,以及根据所述目标GPU测试任务对应的自定义资源实例,利用实时可调度资源为所述目标GPU测试任务创建容器;A scheduling module, used to determine a target GPU test task that currently needs to be executed among the multiple GPU test tasks, and create a container for the target GPU test task using real-time schedulable resources according to a custom resource instance corresponding to the target GPU test task;所述调度模块,还用于控制已创建容器启动执行所述目标GPU测试任务,以及确定所述GPU测试系统中是否存在所述目标GPU测试任务对应的未被执行的剩余测试用例;The scheduling module is further used to control the created container to start executing the target GPU test task, and determine whether there are any remaining unexecuted test cases corresponding to the target GPU test task in the GPU test system;所述调度模块,还用于当存在所述剩余测试用例的情况下,根据所述目标GPU测试任务对应的自定义资源实例,利用实时可调度资源为所述目标GPU测试任务进行容器扩建,其中,每次为所述目标GPU测试任务进行容器扩建的数量小于等于所述目标GPU测试任务对应的单次扩建容器数量。The scheduling module is also used to, when there are remaining test cases, expand the container for the target GPU test task using real-time schedulable resources according to the custom resource instance corresponding to the target GPU test task, wherein the number of containers expanded for the target GPU test task each time is less than or equal to the number of single-expanded containers corresponding to the target GPU test task.12.一种电子设备,其特征在于,包括:12. An electronic device, comprising:处理器;processor;用于存储处理器可执行指令的存储器;a memory for storing processor-executable instructions;其中,所述处理器被配置为调用所述存储器存储的指令,以执行权利要求1至10中任意一项所述的方法。The processor is configured to call the instructions stored in the memory to execute the method described in any one of claims 1 to 10.13.一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至10中任意一项所述的方法。13. A computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1 to 10.
CN202311867994.2A2023-12-292023-12-29GPU test task scheduling method and device, electronic equipment and storage mediumActiveCN117827376B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202311867994.2ACN117827376B (en)2023-12-292023-12-29GPU test task scheduling method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202311867994.2ACN117827376B (en)2023-12-292023-12-29GPU test task scheduling method and device, electronic equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN117827376A CN117827376A (en)2024-04-05
CN117827376Btrue CN117827376B (en)2024-11-15

Family

ID=90518570

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202311867994.2AActiveCN117827376B (en)2023-12-292023-12-29GPU test task scheduling method and device, electronic equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN117827376B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113835865A (en)*2021-09-302021-12-24北京金山云网络技术有限公司Task deployment method and device, electronic equipment and storage medium
CN114968508A (en)*2021-05-062022-08-30中移互联网有限公司Task processing method, device, equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN104063240B (en)*2013-05-202015-09-09腾讯科技(深圳)有限公司A kind of map-indication method and device
CN111459678A (en)*2020-04-022020-07-28上海极链网络科技有限公司Resource scheduling method and device, storage medium and electronic equipment
CN111753443B (en)*2020-07-292024-10-08哈尔滨工业大学 A joint test design method for weapons and equipment based on capability accumulation
CN113596871A (en)*2021-07-052021-11-02哲库科技(上海)有限公司Test method, server and computer storage medium
CN114116481A (en)*2021-11-262022-03-01上海道客网络科技有限公司Kubernetes system-based artificial intelligence algorithm model testing method and system
CN114138439A (en)*2021-11-302022-03-04上海商汤科技开发有限公司Task scheduling method and device, electronic equipment and storage medium
CN114924851B (en)*2022-05-142024-09-24云知声智能科技股份有限公司Training task scheduling method and device, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114968508A (en)*2021-05-062022-08-30中移互联网有限公司Task processing method, device, equipment and storage medium
CN113835865A (en)*2021-09-302021-12-24北京金山云网络技术有限公司Task deployment method and device, electronic equipment and storage medium

Also Published As

Publication numberPublication date
CN117827376A (en)2024-04-05

Similar Documents

PublicationPublication DateTitle
CN113535367B (en)Task scheduling method and related device
CN108885571B (en)Input of batch processing machine learning model
CN109523187B (en)Task scheduling method, device and equipment
CN107943577B (en)Method and device for scheduling tasks
CN113051245A (en)Method, device and system for migrating data
CN108933822B (en) Method and device for processing information
CN112988185A (en)Cloud application updating method, device and system, electronic equipment and storage medium
CN104731706A (en)Method and apparatus for test management using distributed computing
CN115686805A (en)GPU resource sharing method and device, and GPU resource sharing scheduling method and device
US20200153888A1 (en)Message broker configuration
CN114968567A (en)Method, apparatus and medium for allocating computing resources of a compute node
US12128298B2 (en)Method and apparatus for pre-starting cloud application, device, storage medium, and program product
CN117827376B (en)GPU test task scheduling method and device, electronic equipment and storage medium
CN111190725B (en)Task processing method, device, storage medium and server
US10474475B2 (en)Non-intrusive restart of a task manager
CN116483584A (en) GPU task processing method, device, electronic device and storage medium
CN115033251A (en)Software deployment method and device, electronic equipment and storage medium
CN110825438B (en)Method and device for simulating data processing of artificial intelligence chip
CN119149107B (en) Instruction control circuit and instruction control method, processor and chip
CN119722433B (en) A system, method, electronic device and storage medium for synchronizing task flows in a GPU
CN118535316B (en) Method, device and storage medium for cloud computing on cloud system
CN113535187B (en)Service online method, service updating method and service providing method
CN114995981B (en) Parallel task scheduling method, system, storage medium and terminal
CN116070565B (en) Method and device for simulating multi-core processor, electronic equipment and storage medium
CN111399969B (en)Virtual resource arranging system, method, device, medium and equipment

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp