CN116860391A

Movatterモバイル変換

Info

Publication number: CN116860391A
Application number: CN202310801620.4A
Authority: CN
Inventors: 田玉凯; 刘昌松; 徐莉芳; 曹绍猛; 靳新
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-10-10

Abstract

Translated fromChinese

本公开提供了GPU算力资源调度方法、装置、设备和介质。该GPU算力资源调度方法包括：接收来自目标虚拟机的GPU算力资源请求；根据GPU算力资源请求在多个GPU设备中选择目标GPU设备；基于目标GPU设备，生成配置文件；利用设备管理驱动加载配置文件，利用设备传输驱动建立目标虚拟机与目标GPU设备的直通连接；利用设备传输驱动将目标GPU设备中的资源数据透传给目标虚拟机；在目标虚拟机使用目标GPU设备中的算力资源计算结束后，修改配置文件，并利用设备管理驱动加载配置文件，解除目标虚拟机与目标GPU设备的直通连接。本公开实施例能够减少GPU算力资源调度过程中的资源损耗，提高调度效率。本公开实施例可应用于云计算、人工智能等。

The present disclosure provides GPU computing resource scheduling methods, devices, equipment and media. The GPU computing resource scheduling method includes: receiving a GPU computing resource request from a target virtual machine; selecting a target GPU device among multiple GPU devices according to the GPU computing resource request; generating a configuration file based on the target GPU device; utilizing device management The driver loads the configuration file and uses the device transmission driver to establish a direct connection between the target virtual machine and the target GPU device; uses the device transmission driver to transparently transmit the resource data in the target GPU device to the target virtual machine; the target virtual machine uses the data in the target GPU device. After the calculation of computing resources is completed, modify the configuration file, use the device management driver to load the configuration file, and release the direct connection between the target virtual machine and the target GPU device. The embodiments of the present disclosure can reduce resource loss in the GPU computing resource scheduling process and improve scheduling efficiency. Embodiments of the present disclosure can be applied to cloud computing, artificial intelligence, etc.

Description

Translated fromChinese

GPU算力资源调度方法、装置、设备和介质GPU computing resource scheduling method, device, equipment and media

技术领域Technical field

本公开涉及云计算领域，特别是涉及GPU算力资源调度方法、装置、设备和介质。The present disclosure relates to the field of cloud computing, and in particular to GPU computing resource scheduling methods, devices, equipment and media.

背景技术Background technique

在当前云计算中，为了提高计算效率，需要使用额外的算力资源，例如，GPU算力资源。如果直接将GPU算力资源与进行计算的设备进行绑定，会带来很高的计算成本。因此，在现有技术中，将多个GPU设备的算力资源进行池化，在服务器上形成资源池。用户可以根据任务需要调用资源池中的算力资源。但是，在进行资源池化时，还需要对所有GPU卡进行显存划分，这会造成算力资源的损耗。而且用户每一次使用GPU算力资源时，都需要经过服务器才能读取到算力资源存储的位置，降低了数据传输效率，在传输过程中，也会出现算力资源的损耗。In current cloud computing, in order to improve computing efficiency, additional computing resources, such as GPU computing resources, need to be used. If GPU computing resources are directly bound to the computing device, it will bring high computing costs. Therefore, in the existing technology, the computing power resources of multiple GPU devices are pooled to form a resource pool on the server. Users can call computing resources in the resource pool according to task needs. However, when performing resource pooling, it is also necessary to divide the video memory of all GPU cards, which will cause a loss of computing resources. Moreover, every time a user uses GPU computing resources, they need to go through the server to read the location where the computing resources are stored, which reduces data transmission efficiency. During the transmission process, there will also be a loss of computing resources.

发明内容Contents of the invention

本公开实施例提供了GPU算力资源调度方法、装置、设备和介质，能够实现自动化调用GPU算力资源的同时，减少算力资源的损耗，提高数据传输效率。Embodiments of the present disclosure provide GPU computing resource scheduling methods, devices, equipment and media, which can automatically call GPU computing resources while reducing the loss of computing resources and improving data transmission efficiency.

根据本公开的一方面，提供了一种GPU算力资源调度方法，包括：According to one aspect of the present disclosure, a GPU computing resource scheduling method is provided, including:

接收来自目标虚拟机的GPU算力资源请求；Receive GPU computing resource requests from the target virtual machine;

根据所述GPU算力资源请求在多个所述GPU设备中选择目标GPU设备；Select a target GPU device among multiple GPU devices according to the GPU computing resource request;

基于目标GPU设备，生成配置文件；Based on the target GPU device, generate the configuration file;

利用设备管理驱动加载所述配置文件，利用设备传输驱动建立所述目标虚拟机与所述目标GPU设备的直通连接；Use a device management driver to load the configuration file, and use a device transmission driver to establish a direct connection between the target virtual machine and the target GPU device;

利用所述设备传输驱动将所述目标GPU设备中的资源数据透传给所述目标虚拟机；Utilize the device transmission driver to transparently transmit the resource data in the target GPU device to the target virtual machine;

在所述目标虚拟机使用所述目标GPU设备中的算力资源计算结束后，修改所述配置文件，并利用所述设备管理驱动加载所述配置文件，解除所述目标虚拟机与所述目标GPU设备的直通连接。After the target virtual machine uses the computing resources in the target GPU device to complete the calculation, modify the configuration file, use the device management driver to load the configuration file, and disconnect the target virtual machine from the target. Pass-through connection for GPU devices.

根据本公开的一方面，提供了一种GPU算力资源调度装置，包括：According to one aspect of the present disclosure, a GPU computing resource scheduling device is provided, including:

接收单元，用于接收来自目标虚拟机的GPU算力资源请求；The receiving unit is used to receive GPU computing resource requests from the target virtual machine;

分配单元，用于根据所述GPU算力资源请求在多个所述GPU设备中选择目标GPU设备；An allocation unit configured to select a target GPU device among multiple GPU devices according to the GPU computing resource request;

生成单元，用于基于目标GPU设备，生成配置文件；The generation unit is used to generate configuration files based on the target GPU device;

直通建立单元，用于利用设备管理驱动加载所述配置文件，利用设备传输驱动建立所述目标虚拟机与所述目标GPU设备的直通连接；A pass-through establishment unit, configured to load the configuration file using a device management driver, and establish a pass-through connection between the target virtual machine and the target GPU device using a device transmission driver;

透传单元，用于利用设备传输驱动将所述目标GPU设备中的资源数据透传给所述目标虚拟机；A transparent transmission unit, configured to use a device transmission driver to transparently transmit resource data in the target GPU device to the target virtual machine;

直通解绑单元，用于在所述目标虚拟机使用所述目标GPU设备中的算力资源计算结束后，修改所述配置文件，并利用所述设备管理驱动加载所述配置文件，解除所述目标虚拟机与所述目标GPU设备的直通连接。A pass-through unbundling unit, configured to modify the configuration file after the target virtual machine uses the computing resources in the target GPU device to complete the calculation, and use the device management driver to load the configuration file and unbundle the configuration file. A pass-through connection between the target virtual machine and the target GPU device.

可选地，多个所述GPU设备与所述服务器绑定；Optionally, multiple GPU devices are bound to the server;

所述直通建立单元还用于：The pass-through establishment unit is also used for:

获取所述目标虚拟机的节点地址；Obtain the node address of the target virtual machine;

解除所述目标GPU设备与所述服务器的绑定；Unbind the target GPU device from the server;

利用所述设备管理驱动加载所述配置文件，利用所述设备传输驱动按照所述节点地址建立所述目标虚拟机与所述目标GPU设备的直通连接。The device management driver is used to load the configuration file, and the device transmission driver is used to establish a direct connection between the target virtual machine and the target GPU device according to the node address.

可选地，所述直通建立单元还用于：Optionally, the direct connection establishment unit is also used for:

获取所述目标GPU设备的总端口地址；Obtain the total port address of the target GPU device;

将所述总端口地址转化为第一虚拟地址；Convert the general port address into a first virtual address;

获取所述目标GPU设备的功能端口地址；Obtain the function port address of the target GPU device;

将所述功能端口地址转换为第二虚拟地址；Convert the functional port address to a second virtual address;

建立所述总端口地址与所述第一虚拟地址、所述功能端口地址与所述第二虚拟地址之间的映射关系。Establish a mapping relationship between the general port address and the first virtual address, and the functional port address and the second virtual address.

基于所述第一虚拟地址与所述第二虚拟地址，利用所述设备传输驱动在所述目标虚拟机中创建一个虚拟GPU设备；Based on the first virtual address and the second virtual address, use the device transfer driver to create a virtual GPU device in the target virtual machine;

建立所述虚拟GPU设备与所述目标GPU设备的直通连接。Establish a direct connection between the virtual GPU device and the target GPU device.

可选的，所述透传单元包括：Optionally, the transparent transmission unit includes:

将所述目标GPU设备中的所述资源数据封装为资源包，对所述资源包进行加密；Encapsulate the resource data in the target GPU device into a resource package, and encrypt the resource package;

将加密后的资源包透传给所述目标虚拟机。Transparently transmit the encrypted resource package to the target virtual machine.

可选地，所述直通解绑单元还用于：Optionally, the pass-through unbundling unit is also used to:

按照预定周期检测所述目标GPU设备的算力资源使用情况；Detect the computing resource usage of the target GPU device according to a predetermined period;

如果所述目标GPU设备停止计算，在预定时间段后再次检测，如果所述目标GPU设备依旧停止计算，修改所述配置文件；If the target GPU device stops computing, detect again after a predetermined period of time; if the target GPU device still stops computing, modify the configuration file;

利用所述设备管理驱动加载所述配置文件，解除所述目标GPU设备与所述目标虚拟机之间的所述直通连接。The device management driver is used to load the configuration file, and the direct connection between the target GPU device and the target virtual machine is released.

可选地，所述GPU算力资源调度装置还包括：Optionally, the GPU computing resource scheduling device further includes:

监测单元，用于实时监测所述目标GPU设备的算力资源使用情况；A monitoring unit used to monitor the computing resource usage of the target GPU device in real time;

可视化单元，用于将所述算力资源使用情况可视化呈现；A visualization unit, used to visually present the usage of the computing resources;

特征生成单元，用于基于所述算力资源使用情况生成GPU设备算力资源使用特征。A feature generation unit is configured to generate GPU device computing resource usage characteristics based on the computing resource usage.

训练单元，用于利用所述算力资源使用特征训练目标GPU设备确定模型；A training unit, configured to utilize the computing power resource to use the feature to train the target GPU device determination model;

所述分配单元还用于：将所述GPU算力资源请求输入所述目标GPU设备确定模型，得到所述目标GPU设备。The allocation unit is also configured to input the GPU computing resource request into the target GPU device determination model to obtain the target GPU device.

根据本公开的一方面，提供了一种电子设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现如上所述的GPU算力资源调度方法。According to one aspect of the present disclosure, an electronic device is provided, including a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the GPU computing resource scheduling method as described above.

根据本公开的一方面，提供了一种计算机可读存储介质，所述存储介质存储有计算机程序，所述计算机程序被处理器执行时实现如上所述的GPU算力资源调度方法。According to an aspect of the present disclosure, a computer-readable storage medium is provided. The storage medium stores a computer program. When the computer program is executed by a processor, the GPU computing resource scheduling method as described above is implemented.

根据本公开的一方面，提供了一种计算机程序产品，该计算机程序产品包括计算机程序，所述计算机程序被计算机设备的处理器读取并执行，使得该计算机设备执行如上所述的GPU算力资源调度方法。According to an aspect of the present disclosure, a computer program product is provided. The computer program product includes a computer program. The computer program is read and executed by a processor of a computer device, so that the computer device performs the GPU computing power as described above. Resource scheduling methods.

本公开实施例中，服务器在接收到来自目标虚拟机的GPU算力资源请求后，在多个GPU设备中选择一个作为分配给目标虚拟机的目标GPU设备；生成目标GPU的配置文件，利用设备管理驱动加载配置文件，并利用建立目标虚拟机与目标GPU设备之间的直通连接。设备管理驱动用来管理目标虚拟机，并且检测目标GPU设备的使用情况，使得GPU算力资源调度过程是可监测的，以便应对突发情况进行处理。通过配置文件对目标虚拟机与目标GPU设备建立连接可以提高连接稳定性。设备传输驱动可以提高建立直通连接的效率，还可以将目标GPU设备中的资源数据透传给目标虚拟机，提高了数据传输的准确性。通过上述框架，目标虚拟机可以使用一个完整的GPU设备，GPU设备也不需要进行显存划分，保留了完整的计算性能。同时，目标虚拟机可以直接访问目标GPU设备存储的位置，不需要再经过服务器才能找到分配的GPU算力资源，提高了数据传输效率，减少了算力资源损耗。在目标虚拟机计算完成后，解除与目标GPU设备的直通连接，避免因持续绑定而导致的计算成本增高与资源浪费的问题。因此，本公开实施例实现了自动化GPU资源调度，在节省计算成本的同时，提高了GPU算力资源调度过程中的数据传输效率，减少了算力资源损耗。In this disclosed embodiment, after receiving the GPU computing resource request from the target virtual machine, the server selects one of multiple GPU devices as the target GPU device assigned to the target virtual machine; generates a configuration file of the target GPU, and uses the device The management driver loads the configuration file and uses it to establish a pass-through connection between the target virtual machine and the target GPU device. The device management driver is used to manage the target virtual machine and detect the usage of the target GPU device, so that the GPU computing resource scheduling process can be monitored so that emergencies can be handled. Establishing a connection between the target virtual machine and the target GPU device through the configuration file can improve connection stability. The device transmission driver can improve the efficiency of establishing a pass-through connection, and can also transparently transmit the resource data in the target GPU device to the target virtual machine, improving the accuracy of data transmission. Through the above framework, the target virtual machine can use a complete GPU device, and the GPU device does not need to be divided into video memory, retaining complete computing performance. At the same time, the target virtual machine can directly access the storage location of the target GPU device, and does not need to go through the server to find the allocated GPU computing resources, which improves data transmission efficiency and reduces the loss of computing resources. After the calculation of the target virtual machine is completed, the pass-through connection with the target GPU device is released to avoid the problem of increased computing costs and resource waste caused by continuous binding. Therefore, embodiments of the present disclosure realize automated GPU resource scheduling, which not only saves computing costs, but also improves data transmission efficiency in the GPU computing resource scheduling process and reduces computing resource loss.

本公开的其他特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本公开而了解。本公开的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the disclosure will be set forth in the description which follows, and, in part, will be apparent from the description, or may be learned by practice of the disclosure. The objectives and other advantages of the disclosure may be realized and obtained by the structure particularly pointed out in the written description, claims and appended drawings.

附图说明Description of the drawings

附图用来提供对本公开技术方案的进一步理解，并且构成说明书的一部分，与本公开的实施例一起用于解释本公开的技术方案，并不构成对本公开技术方案的限制。The drawings are used to provide a further understanding of the technical solution of the present disclosure, and constitute a part of the specification. They are used to explain the technical solution of the present disclosure together with the embodiments of the present disclosure, and do not constitute a limitation of the technical solution of the present disclosure.

图1是本公开实施例提供的GPU算力资源调度方法的一种体系架构图；Figure 1 is an architecture diagram of a GPU computing resource scheduling method provided by an embodiment of the present disclosure;

图2是本公开实施例的GPU算力资源调度方法的流程图；Figure 2 is a flow chart of a GPU computing resource scheduling method according to an embodiment of the present disclosure;

图3是图2中步骤240的一个具体流程图；Figure 3 is a specific flow chart of step 240 in Figure 2;

图4是图2中步骤240的一个具体流程图；Figure 4 is a specific flow chart of step 240 in Figure 2;

图5是图2中步骤240的一个具体流程图；Figure 5 is a specific flow chart of step 240 in Figure 2;

图6是站在目标虚拟机的角度调取GPU算力资源的示意图；Figure 6 is a schematic diagram of accessing GPU computing resources from the perspective of the target virtual machine;

图7是在步骤240之后添加检测算力资源使用情况的一个具体流程图；Figure 7 is a specific flow chart for adding detection of computing resource usage after step 240;

图8是本公开实施例添加了数据库处理工具与可视化服务器的体系架构图；Figure 8 is an architecture diagram of an embodiment of the present disclosure in which a database processing tool and a visualization server are added;

图9是图2中步骤250的一个具体流程图；Figure 9 is a specific flow chart of step 250 in Figure 2;

图10是图2中步骤260的一个具体流程图；Figure 10 is a specific flow chart of step 260 in Figure 2;

图11是本公开实施例的示例性结构示意图；Figure 11 is an exemplary structural schematic diagram of an embodiment of the present disclosure;

图12是根据本公开实施例的GPU算力资源调度装置的框图；Figure 12 is a block diagram of a GPU computing resource scheduling device according to an embodiment of the present disclosure;

图13是根据本公开实施例图2所示的GPU算力资源调度方法的终端结构图；Figure 13 is a terminal structure diagram of the GPU computing resource scheduling method shown in Figure 2 according to an embodiment of the present disclosure;

图14是根据本公开实施例图2所示的GPU算力资源调度方法的服务器结构图。Figure 14 is a server structure diagram of the GPU computing resource scheduling method shown in Figure 2 according to an embodiment of the present disclosure.

具体实施方式Detailed ways

为了使本公开的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本公开进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本公开，并不用于限定本公开。In order to make the purpose, technical solutions and advantages of the present disclosure more clear, the present disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present disclosure and are not intended to limit the present disclosure.

对本公开实施例进行进一步详细说明之前，对本公开实施例中涉及的名词和术语进行说明，本公开实施例中涉及的名词和术语适用于如下的解释：Before further describing the embodiments of the present disclosure in detail, the nouns and terms involved in the embodiments of the present disclosure are explained. The nouns and terms involved in the embodiments of the present disclosure are applicable to the following explanations:

云计算：通过网络按需提供可动态伸缩的计算服务。按照用户需求，提供可用的、便捷的、按需的网络访问，进入可配置的计算资源共享池(资源包括网络、服务器、存储、应用软件、服务)，这些资源能够被快速提供，只需投入很少的管理工作，或与服务供应商进行很少的交互。Cloud Computing: Providing dynamically scalable computing services on demand over a network. Provide available, convenient, on-demand network access according to user needs, and enter a configurable computing resource sharing pool (resources include networks, servers, storage, application software, and services). These resources can be provided quickly and only require investment Little administrative effort or minimal interaction with service providers.

Graphics Processing Unit(GPU)：GPU是一种在个人电脑、工作站、游戏机和移动设备上进行图像和图形相关运算工作的图形处理器。Graphics Processing Unit (GPU): GPU is a graphics processor that performs image and graphics-related operations on personal computers, workstations, game consoles, and mobile devices.

显存：位于显卡的内存，其主要用途是存储要处理的图形信息的部件，与GPU配合进行图形处理。Video memory: The memory located in the graphics card. Its main purpose is to store graphics information to be processed, and to cooperate with the GPU for graphics processing.

虚拟机：通过软件模拟的完整计算机系统。在实体计算机中能够完成的工作在虚拟机中都能够实现。在计算机中创建虚拟机时，需要将实体机的部分硬盘和内存容量作为虚拟机的硬盘和内存容量。每个虚拟机都有独立的硬盘和操作系统，可以像使用实体机一样对虚拟机进行操作。Virtual machine: A complete computer system simulated by software. Everything that can be done on a physical computer can be done on a virtual machine. When creating a virtual machine on a computer, you need to use part of the hard disk and memory capacity of the physical machine as the hard disk and memory capacity of the virtual machine. Each virtual machine has an independent hard disk and operating system, and the virtual machine can be operated like a physical machine.

peripheral component interconnect express(PCIE)：是一种高速串行计算机扩展总线标准。属于高速串行点对点双通道高带宽传输，所连接的设备分配独享通道带宽，不共享总线带宽。PCIE有两种存在形式M.2接口通道形式和PCIE标准插槽。PCIE的可拓展性强，可以支持插入多种设备，例如：显卡、无线网卡、声卡等。peripheral component interconnect express (PCIE): is a high-speed serial computer expansion bus standard. It is a high-speed serial point-to-point dual-channel high-bandwidth transmission. The connected device is allocated exclusive channel bandwidth and does not share the bus bandwidth. PCIE has two forms of existence: M.2 interface channel form and PCIE standard slot. PCIE has strong scalability and can support the insertion of multiple devices, such as graphics cards, wireless network cards, sound cards, etc.

API接口：应用程序接口。应用程序接口是一组定义、程序及协议的集合，通过API接口实现计算机软件之间的相互通信。API的一个主要功能是提供通用功能集。程序员通过调用API函数对应用程序进行开发，可以减轻编程任务。API同时也是一种中间件，为各种不同平台提供数据共享。API interface: application program interface. An application programming interface is a set of definitions, procedures, and protocols that enable computer software to communicate with each other through an API interface. A primary function of an API is to provide a common feature set. Programmers can alleviate programming tasks by developing applications by calling API functions. API is also a kind of middleware that provides data sharing for various platforms.

直接存储器访问(DMA)：用来提供外设和存储器之间，或者存储器和存储器之间的高速数据传输。无须CPU控制，数据可以通过DMA传输，它通过硬件为RAM与I/O设备之间开辟一条直接传送数据的通道，这节省CPU的资源，使CPU效率提高。Direct Memory Access (DMA): used to provide high-speed data transfer between peripherals and memory, or between memory and memory. Without CPU control, data can be transferred through DMA, which opens a channel for direct data transmission between RAM and I/O devices through hardware, which saves CPU resources and improves CPU efficiency.

virsh：是一个由C语言编写的、用来管理虚拟机虚拟化的命令行工具，系统管理员可以通过vish命令操作虚拟机。virsh: It is a command line tool written in C language and used to manage virtual machine virtualization. System administrators can operate virtual machines through the vish command.

图1是根据本公开的实施例的数据库测试方法所应用的系统构架图。它包括目标虚拟机110、互联网120、服务器130、PCIE140、与GPU设备141。Figure 1 is a system architecture diagram to which a database testing method is applied according to an embodiment of the present disclosure. It includes target virtual machine 110, Internet 120, server 130, PCIE 140, and GPU device 141.

目标虚拟机110是具有调用GPU算力资源进行计算的需求的虚拟机。它包括在桌面电脑、膝上型电脑、PDA(个人数字助理)、专用终端、计算机设备等多种形式中划出的虚拟机。另外，它由单台设备构成，也可以由多台设备组成的集合构成。例如，多台设备通过局域网连接，公用一台显示设备进行协同工作，共同构成一个目标虚拟机110。目标虚拟机110可以以有线或无线的方式与互联网120进行通信，交换数据。The target virtual machine 110 is a virtual machine that needs to call GPU computing resources for calculation. It includes virtual machines drawn out in various forms such as desktop computers, laptops, PDAs (Personal Digital Assistants), dedicated terminals, computer equipment, etc. In addition, it consists of a single device or a collection of multiple devices. For example, multiple devices are connected through a local area network and use a common display device to work together to form a target virtual machine 110. The target virtual machine 110 can communicate with the Internet 120 in a wired or wireless manner to exchange data.

服务器130是指向目标虚拟机110提供GPU资源调度服务的计算机系统。服务器130可以是网络平台中的一台高性能计算机、多台高性能计算机的集群等。服务器130通过可以以有线或无线的方式与互联网120进行通信，交换数据。The server 130 is a computer system that provides GPU resource scheduling services to the target virtual machine 110 . The server 130 may be a high-performance computer in a network platform, a cluster of multiple high-performance computers, etc. The server 130 may communicate with the Internet 120 in a wired or wireless manner to exchange data.

PCIE140是服务器130上用于插入GPU设备141的插槽。每个GPU设备141在PCIE140上有一个唯一的接口地址。PCIE 140 is a slot on server 130 for inserting GPU device 141. Each GPU device 141 has a unique interface address on PCIE 140.

根据本公开的一个实施例，提供了一种GPU算力资源调度方法。According to an embodiment of the present disclosure, a GPU computing resource scheduling method is provided.

GPU算力资源受服务器130管理，用于支持目标虚拟机110中的一项任务的计算。当目标虚拟机110发送一个GPU算力资源请求时，根据GPU算力资源请求向目标虚拟机110分配GPU算力资源。在现有的方案中，多个GPU设备141的算力资源在服务器130中形成资源池，根据GPU算力资源请求调用资源池中的算力资源。同时，还要对所有GPU设备141进行显存划分，在这个过程中，算力资源会出现损耗。不仅如此，目标虚拟机110在调用GPU数据时，一定要首先经过服务器130才可以找到分配的GPU算力资源，不仅数据传输效率低，而且也会出现算力资源损耗。本公开实施例的GPU算力资源调度方法可以提高数据传输效率，减少算力资源损耗。GPU computing resources are managed by the server 130 and used to support the calculation of a task in the target virtual machine 110 . When the target virtual machine 110 sends a GPU computing resource request, the GPU computing resource is allocated to the target virtual machine 110 according to the GPU computing resource request. In the existing solution, the computing resources of multiple GPU devices 141 form a resource pool in the server 130, and the computing resources in the resource pool are called according to the GPU computing resource request. At the same time, the video memory of all GPU devices 141 must be divided. In this process, computing resources will be lost. Not only that, when the target virtual machine 110 calls GPU data, it must first go through the server 130 to find the allocated GPU computing resources. Not only is the data transmission efficiency low, but there will also be a loss of computing resources. The GPU computing resource scheduling method of the embodiment of the present disclosure can improve data transmission efficiency and reduce computing resource loss.

本公开实施例提供的GPU算力资源调度方法应用于服务器130对多个GPU设备141的算力资源调度，如图2所示，该方法包括：The GPU computing resource scheduling method provided by the embodiment of the present disclosure is applied to the computing resource scheduling of multiple GPU devices 141 by the server 130. As shown in Figure 2, the method includes:

步骤210、接收来自目标虚拟机的GPU算力资源请求；Step 210: Receive the GPU computing resource request from the target virtual machine;

步骤220、根据GPU算力资源请求在多个GPU设备中选择目标GPU设备；Step 220: Select a target GPU device among multiple GPU devices according to the GPU computing resource request;

步骤230、基于目标GPU设备，生成配置文件；Step 230: Generate a configuration file based on the target GPU device;

步骤240、利用设备管理驱动加载配置文件，利用设备传输驱动建立目标虚拟机与目标GPU设备的直通连接；Step 240: Use the device management driver to load the configuration file, and use the device transmission driver to establish a direct connection between the target virtual machine and the target GPU device;

步骤250、利用设备传输驱动将目标GPU设备中的资源数据透传给目标虚拟机；Step 250: Use the device transmission driver to transparently transmit the resource data in the target GPU device to the target virtual machine;

步骤260、在目标虚拟机使用目标GPU设备中的算力资源计算结束后，修改配置文件，并利用设备管理驱动加载配置文件，解除目标虚拟机与目标GPU设备的直通连接。Step 260: After the target virtual machine uses the computing resources in the target GPU device to complete the calculation, modify the configuration file, use the device management driver to load the configuration file, and release the direct connection between the target virtual machine and the target GPU device.

为了保证本公开实施例提供的GPU算力资源调度方法可以顺利实施，在一个实施例中，在接收来自目标虚拟机110的GPU算力资源请求之前，还需要对服务器130系统进行配置，具体包括：In order to ensure that the GPU computing resource scheduling method provided by the embodiment of the present disclosure can be implemented smoothly, in one embodiment, before receiving the GPU computing resource request from the target virtual machine 110, the server 130 system also needs to be configured, specifically including: :

检测服务器130系统是否可以与目标虚拟机110正常连接。例如，当目标虚拟机110是虚拟机时，需要检验服务器130系统是否可以实现正常创建、启动、与连接虚拟机。检测服务器130系统是否支持I/O设备的虚拟化，以确保目标GPU设备的资源可以直接分配给目标虚拟机110。Check whether the server 130 system can normally connect to the target virtual machine 110. For example, when the target virtual machine 110 is a virtual machine, it needs to be checked whether the server 130 system can normally create, start, and connect the virtual machine. It is detected whether the server 130 system supports virtualization of the I/O device to ensure that the resources of the target GPU device can be directly allocated to the target virtual machine 110 .

启动内核参数，确保目标虚拟机110在进行内存映射时，可以快速且准确地访问服务器130中大量的连续物理内存，还可以访问到服务器130的高端内存地址。如果不启动内核参数，在一些设备需要大量的物理连续内存时，服务器130无法为其分配。同时，大多数服务器130在进行直接存储器访问时，只能指出32位的寻址，为了访问高端内存地址中的数据，需要在低端内存中为高端内存分配一个临时数据缓存，将高端内存中的数据复制到临时数据缓存，在想要访问高端内存中的数据时，需要通过临时数据缓存访问高端内存地址。临时数据缓存会额外占用许多内存，而且在进行数据复制时，会额外增加服务器130内存的负担。在启动内核参数后，在后续过程中目标虚拟机110对目标GPU设备的访问时，可以根据虚拟内存与物理内存之间的映射表，直接访问目标GPU设备在服务器130中的物理地址，无论是低端内存地址还是高端内存地址。总的来说，启动内核参数可以提高内存地址访问效率与准确性，也可以节约内存空间，提高内存运行效率。Start the kernel parameters to ensure that the target virtual machine 110 can quickly and accurately access a large amount of continuous physical memory in the server 130 when performing memory mapping, and can also access the high-end memory address of the server 130. If the kernel parameters are not enabled, when some devices require a large amount of physical contiguous memory, the server 130 cannot allocate it. At the same time, most servers 130 can only point to 32-bit addressing when performing direct memory access. In order to access data in the high-end memory address, a temporary data cache needs to be allocated in the low-end memory for the high-end memory, and the high-end memory must be The data is copied to the temporary data cache. When you want to access the data in the high-end memory, you need to access the high-end memory address through the temporary data cache. The temporary data cache will occupy a lot of additional memory, and will increase the load on the server 130 memory during data replication. After starting the kernel parameters, when the target virtual machine 110 accesses the target GPU device in the subsequent process, it can directly access the physical address of the target GPU device in the server 130 according to the mapping table between virtual memory and physical memory. Low memory address or high memory address. In general, starting kernel parameters can improve the efficiency and accuracy of memory address access, save memory space, and improve memory operation efficiency.

获取GPU设备141在服务器130上的总线地址。总线地址是服务器130可以直接访问GPU设备141的地址。当多个GPU设备141在服务器130上共享一块范围较大的内存空间时，获取GPU设备141的总线地址还包括：获取GPU设备141所处内存组的地址；进而获取GPU的总线地址。通过GPU设备141的总线地址可以确定寄存器中存储的GPU设备141的端口地址。Obtain the bus address of the GPU device 141 on the server 130. The bus address is an address at which the server 130 can directly access the GPU device 141 . When multiple GPU devices 141 share a large memory space on the server 130, obtaining the bus address of the GPU device 141 also includes: obtaining the address of the memory group where the GPU device 141 is located; and then obtaining the bus address of the GPU. The port address of the GPU device 141 stored in the register can be determined by the bus address of the GPU device 141 .

基于GPU设备141的端口地址确定GPU设备141的功能端口数量。功能端口数量用来表示GPU设备141中具体包含了几个功能，例如，有些GPU设备141通过USB端口可以实现与其他设备进行数据交换的功能。每个GPU设备141只有一个总线地址，但可以包含多个功能端口。各个功能端口之间不是相互独立的，全部功能端口共同配合才可以正常运行GPU设备141。因此，在后续的地址映射过程中，需要将所有功能端口地址映射过去，以保证GPU设备141可以正常运行。The number of functional ports of the GPU device 141 is determined based on the port address of the GPU device 141 . The number of function ports is used to indicate how many functions the GPU device 141 specifically includes. For example, some GPU devices 141 can implement the function of exchanging data with other devices through USB ports. Each GPU device 141 has only one bus address but can contain multiple functional ports. Each functional port is not independent of each other, and all functional ports must work together to operate the GPU device 141 normally. Therefore, in the subsequent address mapping process, all functional port addresses need to be mapped to ensure that the GPU device 141 can operate normally.

在进行GPU算力资源调度之前，对服务器130系统进行以上配置过程，可以确保在后续进行GPU算力资源调度时，服务器130可以高效稳定的运行。Before performing GPU computing resource scheduling, performing the above configuration process on the server 130 system can ensure that the server 130 can operate efficiently and stably during subsequent GPU computing resource scheduling.

步骤210中，目标虚拟机110的GPU算力资源请求中包含目标虚拟机110的节点地址、发起GPU算力资源请求的时间、与目标虚拟机110中待处理事务的类型。In step 210 , the GPU computing resource request of the target virtual machine 110 includes the node address of the target virtual machine 110 , the time when the GPU computing resource request is initiated, and the type of transaction to be processed in the target virtual machine 110 .

在目标虚拟机110中设置一个监测代理节点，用于根据目标虚拟机110的待处理事务生成GPU算力资源请求，并将GPU算力资源请求通过服务器130上用于接收请求的API接口发送给服务器130。当服务器130为目标虚拟机110与目标GPU设备建立直通连接后，监测代理节点可以用来监测目标GPU设备的使用情况。当目标虚拟机110使用目标GPU设备完成计算后，监测代理节点生成GPU算力资源释放请求，并将该请求通过上述用于接收请求的API接口发送给服务器130，由服务器130进行后续的解除连接操作。设置监测代理节点的优点是，可以代替目标虚拟机110对目标GPU设备进行管理，避免占用目标虚拟机110的计算空间，提高了目标虚拟机110利用目标GPU设备处理待处理事务的效率。A monitoring agent node is set up in the target virtual machine 110 to generate a GPU computing resource request according to the pending transaction of the target virtual machine 110 and send the GPU computing resource request to the server 130 through the API interface used to receive the request. Server 130. After the server 130 establishes a direct connection between the target virtual machine 110 and the target GPU device, the monitoring agent node can be used to monitor the usage of the target GPU device. After the target virtual machine 110 completes the calculation using the target GPU device, the monitoring agent node generates a GPU computing resource release request, and sends the request to the server 130 through the above-mentioned API interface for receiving requests, and the server 130 performs subsequent disconnection. operate. The advantage of setting up a monitoring agent node is that it can manage the target GPU device instead of the target virtual machine 110, avoid occupying the computing space of the target virtual machine 110, and improve the efficiency of the target virtual machine 110 using the target GPU device to process pending transactions.

在步骤220中，服务器130在通过用于接收请求的API接收到来自目标虚拟机110的GPU算力资源请求后，从多个GPU设备141中选择一个作为目标GPU设备。In step 220 , after receiving the GPU computing resource request from the target virtual machine 110 through the API for receiving the request, the server 130 selects one of the plurality of GPU devices 141 as the target GPU device.

选择目标GPU设备主要基于GPU算力资源请求中的待处理事务。根据待处理事务的类型、与计算量等特征选择合适的GPU设备141。例如，待处理事务是对游戏中的场景界面图进行3D渲染，那么目标GPU设备需要能够支持3D渲染的运行，同时还有保证算力资源可以满足场景界面图中大量数据进行计算的需要。The target GPU device is selected mainly based on the pending transactions in the GPU computing resource request. Select an appropriate GPU device 141 according to characteristics such as the type of transaction to be processed and the amount of calculation. For example, if the task to be processed is to perform 3D rendering of the scene interface diagram in the game, then the target GPU device needs to be able to support the operation of 3D rendering, and at the same time ensure that the computing resources can meet the needs of computing a large amount of data in the scene interface diagram.

在一个实施例中，根据GPU算力资源请求在多个GPU设备中选择目标GPU设备，包括：将GPU算力资源请求输入目标GPU设备确定模型，得到目标GPU设备。In one embodiment, selecting a target GPU device among multiple GPU devices based on a GPU computing power resource request includes: inputting the GPU computing power resource request into a target GPU device determination model to obtain the target GPU device.

目标GPU设备确定模型可以是一个机器学习模型，用于根据GPU算力资源请求的特征从多个GPU设备中确定目标GPU设备。目标GPU设备确定模型是根据历史调度中的GPU使用情况特征进行训练的，在后续步骤中将详细说明。The target GPU device determination model may be a machine learning model used to determine the target GPU device from multiple GPU devices based on the characteristics of the GPU computing resource request. The target GPU device determination model is trained based on GPU usage characteristics in historical schedules, which will be detailed in subsequent steps.

将GPU算力资源请求输入目标GPU设备确定模型中，模型会首先从GPU算力资源请求提取出多个可供模型识别的特征。例如，从GPU算力资源请求中提取出待处理请求类型、数据规模、期望精准度等。基于上述特征，目标GPU设备确定模型可以确定出适用于目标虚拟机110的GPU算力资源请求的目标GPU设备。Input the GPU computing resource request into the target GPU device determination model. The model will first extract multiple features from the GPU computing resource request that can be identified by the model. For example, the type of request to be processed, data size, expected accuracy, etc. are extracted from the GPU computing resource request. Based on the above characteristics, the target GPU device determination model can determine the target GPU device suitable for the GPU computing resource request of the target virtual machine 110 .

由于目标GPU设备确定模型是基于历史调度中GPU设备141的使用情况的经验训练生成的，因此，利用该模型确定目标GPU设备的优点是，使得目标GPU设备可以满足目标虚拟机110中待处理事务的处理需求，提高了确定目标GPU设备的准确性。Since the target GPU device determination model is generated based on empirical training of the usage of the GPU device 141 in historical scheduling, the advantage of using this model to determine the target GPU device is that the target GPU device can satisfy the pending transactions in the target virtual machine 110 processing requirements, improving the accuracy of determining the target GPU device.

步骤230中，在确定目标GPU设备后，服务器130需要生成目标GPU设备的配置文件。In step 230, after determining the target GPU device, the server 130 needs to generate a configuration file of the target GPU device.

生成配置文件的作用是为了可以让目标虚拟机110使用目标GPU设备。因此，配置文件需要根据目标GPU设备的多项属性值生成，以保证目标虚拟机110与目标GPU设备可以正常连通。The purpose of generating the configuration file is to enable the target virtual machine 110 to use the target GPU device. Therefore, the configuration file needs to be generated based on multiple attribute values of the target GPU device to ensure that the target virtual machine 110 and the target GPU device can be connected normally.

用于生成配置文件的多项属性值至少包括：目标GPU设备的网域地址、目标GPU设备的总线地址、目标GPU设备的接口位置、与目标GPU设备的功能端口数量。The multiple attribute values used to generate the configuration file include at least: the network domain address of the target GPU device, the bus address of the target GPU device, the interface location of the target GPU device, and the number of functional ports with the target GPU device.

网域地址用于在进行数据传输时对目标GPU设备进行定位的定位标识。总线地址是服务器130可以直接访问GPU设备141的内存地址。接口位置是指目标GPU设备在PCIE140设备中的接口编号，由于PCIE140设备中有多个插槽，对应多个接口编号。接口位置指示了目标GPU设备具体插在哪个插槽中。The domain address is used to locate the target GPU device during data transmission. The bus address is a memory address where the server 130 can directly access the GPU device 141 . The interface location refers to the interface number of the target GPU device in the PCIE140 device. Since there are multiple slots in the PCIE140 device, it corresponds to multiple interface numbers. The interface location indicates which slot the target GPU device is plugged into.

功能端口数量用来表示目标GPU设备中具体包含了几个功能。在生成配置文件时，需要确定目标GPU设备的端口数量，并将每一个端口的地址加入到配置文件中，以便目标虚拟机110可以识别目标GPU设备上的全部功能端口，保证目标GPU设备可以正常运行。The number of function ports is used to indicate how many functions are included in the target GPU device. When generating the configuration file, it is necessary to determine the number of ports of the target GPU device and add the address of each port to the configuration file so that the target virtual machine 110 can identify all functional ports on the target GPU device and ensure that the target GPU device can function normally. run.

上述四个属性值在xml文件中可以表示为：The above four attribute values can be expressed in the xml file as:

domain＝"0x"${result:4:4}#4bitdomain="0x"${result:4:4}#4bit

bus＝"0x"${result:9:2}#2bitbus="0x"${result:9:2}#2bit

slot＝"0x"${result:12:2}#2bitslot="0x"${result:12:2}#2bit

function＝"0x"${result:-1}#1bitfunction="0x"${result:-1}#1bit

其中，domain是目标GPU设备的网域地址，其属性值长度为4比特；bus为目标GPU设备的总线地址，其属性值长度为2比特；slot为目标GPU设备的接口位置，其属性值长度为2比特；function为目标GPU设备的功能端口数量，其属性值长度为1比特。Among them, domain is the network domain address of the target GPU device, and its attribute value length is 4 bits; bus is the bus address of the target GPU device, and its attribute value length is 2 bits; slot is the interface location of the target GPU device, and its attribute value length is is 2 bits; function is the number of function ports of the target GPU device, and its attribute value length is 1 bit.

在前述实施例的对服务器130的配置过程中，可以获取与服务器130连接的多个GPU设备141的上述属性值，因此，直接调取目标GPU的属性值，用于生成配置文件。配置文件可以是xml文件。During the configuration process of the server 130 in the foregoing embodiment, the above attribute values of multiple GPU devices 141 connected to the server 130 can be obtained. Therefore, the attribute values of the target GPU are directly retrieved for generating the configuration file. The configuration file can be an xml file.

步骤240中，首先利用设备管理驱动加载配置文件。设备管理驱动用于管理与服务器130建立连接的目标虚拟机110。针对目标虚拟机110，所使用的的设备管理驱动可以是Libvirt。Libvirt用于对虚拟机进行管理的工具和API。下面以Libvirt为例，示例性地说明本步骤。In step 240, the configuration file is first loaded using the device management driver. The device management driver is used to manage the target virtual machine 110 that establishes a connection with the server 130 . For the target virtual machine 110, the device management driver used may be Libvirt. Libvirt is a tool and API used to manage virtual machines. The following takes Libvirt as an example to illustrate this step.

Libvirt可以为连接的目标虚拟机110提供API以实现在服务器130上对目标虚拟机110的操作与管理。Libvirt可以被本地调用，也就是可以管理服务器130本地的虚拟机。也可以被远程调用，也就是虚拟机的运行程序与域不在本服务器130上。Libvirt支持多种通用的远程协议，也就是说在目标虚拟机110与GPU设备141直通连接的过程中，Libvirt可以为目标虚拟机110提供多种管理服务，例如：命令行界面管理、文件传输管理、数据加密管理等。Libvirt can provide an API for the connected target virtual machine 110 to implement operation and management of the target virtual machine 110 on the server 130 . Libvirt can be called locally, that is, it can manage the virtual machine local to the server 130. It can also be called remotely, that is, the running program and domain of the virtual machine are not on the server 130. Libvirt supports a variety of common remote protocols, which means that during the direct connection between the target virtual machine 110 and the GPU device 141, Libvirt can provide a variety of management services for the target virtual machine 110, such as command line interface management and file transfer management. , data encryption management, etc.

Libvirt上还包含了一个守护进程，Libvirtd，可以实现对目标虚拟机110进行实时监测，避免被服务器130中产生的其他信息打断。当目标虚拟机110与GPU设备141的直通连接出现问题，通过守护进程可以及时感知错误并解决问题。其他上层管理工具，例如界面管理工具、数据管理工具等，也可以通过守护程序连接到目标虚拟机110。守护程序执行到来自其他上层管理工具的操作指令对目标虚拟机110进行操作。Libvirt also includes a daemon process, Libvirtd, which can monitor the target virtual machine 110 in real time to avoid being interrupted by other information generated in the server 130. When a problem occurs in the direct connection between the target virtual machine 110 and the GPU device 141, the daemon process can detect the error in time and solve the problem. Other upper-layer management tools, such as interface management tools, data management tools, etc., can also be connected to the target virtual machine 110 through the daemon. The daemon executes operation instructions from other upper-layer management tools to operate the target virtual machine 110 .

在Libvirtd中使用命令行工具virsh对目标虚拟机110进行管理，例如虚拟机的启动、关闭、重启、与迁移等，还可以收集目标虚拟机110与服务器130的配置、和资源使用情况。Use the command line tool virsh in Libvirtd to manage the target virtual machine 110, such as starting, shutting down, restarting, and migrating the virtual machine. You can also collect the configuration and resource usage of the target virtual machine 110 and server 130.

在设备管理驱动加载配置文件时，可以通过virsh命令在于配置文件，通过运行配置文件，建立目标虚拟机110与目标GPU设备的直通连接。When the device management driver loads the configuration file, you can use the virsh command to download the configuration file and run the configuration file to establish a direct connection between the target virtual machine 110 and the target GPU device.

在一个实施例中，多个GPU设备141与服务器130是直接绑定的，因此，如图3所示，利用设备管理驱动加载配置文件，利用设备传输驱动建立目标虚拟机与目标GPU设备的直通连接，包括：In one embodiment, multiple GPU devices 141 are directly bound to the server 130. Therefore, as shown in Figure 3, the device management driver is used to load the configuration file, and the device transmission driver is used to establish a direct connection between the target virtual machine and the target GPU device. Connections, including:

步骤310、获取目标虚拟机的节点地址；Step 310: Obtain the node address of the target virtual machine;

步骤320、解除目标GPU设备与服务器的绑定；Step 320: Unbind the target GPU device from the server;

步骤330、利用设备管理驱动加载配置文件，利用设备传输驱动按照节点地址建立目标虚拟机与目标GPU设备的直通连接。Step 330: Use the device management driver to load the configuration file, and use the device transmission driver to establish a direct connection between the target virtual machine and the target GPU device according to the node address.

目标虚拟机110的节点地址通过目标虚拟机110的GPU算力资源请求可以直接获取。The node address of the target virtual machine 110 can be directly obtained through the GPU computing resource request of the target virtual machine 110 .

通过获取的目标GPU设备的总线地址，找到目标GPU设备的总端口地址，将目标GPU设备的总端口地址与服务器130进行解除绑定。解除绑定后，目标GPU设备可以直接与目标虚拟机110建立连通关系。Through the acquired bus address of the target GPU device, the total port address of the target GPU device is found, and the total port address of the target GPU device is unbound from the server 130 . After unbinding, the target GPU device can directly establish a connectivity relationship with the target virtual machine 110 .

在设备管理驱动加载配置文件时，加载目标GPU设备的网域地址、总线地址、接口位置、与功能端口数量，基于网域地址、总线地址、与接口位置获取目标GPU设备的位置，以便将目标GPU设备映射到目标虚拟机110。基于功能端口数量，依次遍历目标GPU上的功能端口，获取每个功能端口的位置，以便将所有功能端口映射到目标虚拟机110。When the device management driver loads the configuration file, load the domain address, bus address, interface location, and number of functional ports of the target GPU device, and obtain the location of the target GPU device based on the domain address, bus address, and interface location, so that the target The GPU device is mapped to the target virtual machine 110. Based on the number of function ports, the function ports on the target GPU are traversed in sequence to obtain the location of each function port, so that all function ports are mapped to the target virtual machine 110.

具体实现的源码如下：The specific implementation source code is as follows:

在上述源码中，通过result_deal()方法加载配置文件。Node是指目标虚拟机节点地址，例如：i-0000002D。读取目标虚拟机节点地址，获取xml配置文件，通过vi rsh attch命令将目标虚拟机节点与配置文件连通在一起。In the above source code, the configuration file is loaded through the result_deal() method. Node refers to the target virtual machine node address, for example: i-0000002D. Read the address of the target virtual machine node, obtain the xml configuration file, and connect the target virtual machine node and the configuration file through the vi rsh attach command.

在建立直通连接后，在配置文件中设置启动参数，用来表示目标虚拟机与目标GPU设备绑定。在后续的服务器或虚拟机重新启动后，目标虚拟机与目标GPU设备也会按照配置文件进行绑定。这提高了目标虚拟机与目标GPU设备之间直通连接的稳定性。After establishing a pass-through connection, set startup parameters in the configuration file to indicate that the target virtual machine is bound to the target GPU device. After subsequent server or virtual machine restarts, the target virtual machine and target GPU device will also be bound according to the configuration file. This improves the stability of the pass-through connection between the target virtual machine and the target GPU device.

本实施例的优点是，通过解除目标GPU设备与服务器130的绑定关系，防止服务器130占用目标GPU设备的算力资源，避免服务器130对目标虚拟机110访问目标GPU设备的过程产生影响。The advantage of this embodiment is that by unbinding the target GPU device and the server 130, the server 130 is prevented from occupying the computing resources of the target GPU device, and the server 130 is prevented from affecting the process of the target virtual machine 110 accessing the target GPU device.

在一个实施例中，如图4所示，利用设备管理驱动加载配置文件，利用设备传输驱动建立目标虚拟机与目标GPU设备的直通连接，包括：In one embodiment, as shown in Figure 4, the device management driver is used to load the configuration file, and the device transmission driver is used to establish a direct connection between the target virtual machine and the target GPU device, including:

步骤410、获取目标GPU设备的总端口地址；Step 410: Obtain the general port address of the target GPU device;

步骤420、将总端口地址转化为第一虚拟地址；Step 420: Convert the general port address into the first virtual address;

步骤430、获取目标GPU设备的功能端口地址；Step 430: Obtain the function port address of the target GPU device;

步骤440、将功能端口地址转换为第二虚拟地址；Step 440: Convert the functional port address to the second virtual address;

步骤450、建立总端口地址与第一虚拟地址、功能端口地址与第二虚拟地址之间的映射关系。Step 450: Establish a mapping relationship between the general port address and the first virtual address, and the functional port address and the second virtual address.

目标GPU设备的总端口地址是目标GPU设备用来直接与目标GPU进行数据交换的物理地址。利用内存管理单元将总端口地址转化为第一虚拟地址。功能端口地址是目标GPU设备中的功能对应的端口地址，可以用来与目标GPU中对应的功能进行数据交换。由于虚拟机无法识别物理地址，如果直接将物理地址传输给虚拟机，那么虚拟机无法存储物理地址，导致虚拟机不知道具体哪一个GPU设备141与它连通。当要进行直接存储器访问时，可能会破坏内存，也可能会访问错误的GPU设备141。因此，将总端口地址与功能端口地址通过内存管理单元转换成为目标虚拟机110可以识别的第一虚拟地址与第二虚拟地址，并建立总端口地址与第一虚拟地址、功能端口地址与第二虚拟地址之间的映射关系，使得目标虚拟机110可以根据映射关系正确且高效地访问到目标GPU设备。The total port address of the target GPU device is the physical address used by the target GPU device to directly exchange data with the target GPU. The memory management unit is used to convert the total port address into the first virtual address. The function port address is the port address corresponding to the function in the target GPU device, and can be used to exchange data with the corresponding function in the target GPU. Since the virtual machine cannot recognize the physical address, if the physical address is directly transmitted to the virtual machine, the virtual machine cannot store the physical address, resulting in the virtual machine not knowing which GPU device 141 is connected to it. When a direct memory access is to be made, memory may be corrupted or the wrong GPU device may be accessed141. Therefore, the general port address and the functional port address are converted into the first virtual address and the second virtual address that can be recognized by the target virtual machine 110 through the memory management unit, and the general port address and the first virtual address, the functional port address and the second virtual address are established. The mapping relationship between virtual addresses enables the target virtual machine 110 to correctly and efficiently access the target GPU device according to the mapping relationship.

在建立直通连接时，启动目标虚拟机110对应的设备传输驱动。设备传输驱动用于将目标GPU设备与其所有的功能端口映射到目标虚拟机110上。设备传输驱动可以是VFIO驱动。在一个实施例中，如图5所示，利用设备传输驱动建立目标虚拟机110与目标GPU设备的直通连接，包括：When establishing a pass-through connection, the device transmission driver corresponding to the target virtual machine 110 is started. The device transfer driver is used to map the target GPU device and all its function ports to the target virtual machine 110 . The device transmission driver can be a VFIO driver. In one embodiment, as shown in Figure 5, a device transfer driver is used to establish a direct connection between the target virtual machine 110 and the target GPU device, including:

步骤510、基于第一虚拟地址与第二虚拟地址，利用设备传输驱动在目标虚拟机中创建一个虚拟GPU设备；Step 510: Based on the first virtual address and the second virtual address, use the device transmission driver to create a virtual GPU device in the target virtual machine;

步骤520、建立虚拟GPU设备与目标GPU设备的直通连接。Step 520: Establish a direct connection between the virtual GPU device and the target GPU device.

设备传输驱动可以在目标虚拟机110上基于第一虚拟地址与第二虚拟地址创建一个虚拟GPU设备141，虚拟GPU设备141的在目标虚拟机110上的地址实际上就是第一虚拟地址，目标虚拟机110上也包括与目标虚拟机110相同的功能端口，功能端口在目标虚拟机110上的地址实际上就是第二虚拟地址。建立目标虚拟机110与目标GPU设备之间的直通连接，实际上是建立虚拟GPU设备141与目标GPU设备的直通连接。The device transfer driver can create a virtual GPU device 141 on the target virtual machine 110 based on the first virtual address and the second virtual address. The address of the virtual GPU device 141 on the target virtual machine 110 is actually the first virtual address. The target virtual address The machine 110 also includes the same functional port as the target virtual machine 110, and the address of the functional port on the target virtual machine 110 is actually the second virtual address. Establishing a direct connection between the target virtual machine 110 and the target GPU device actually establishes a direct connection between the virtual GPU device 141 and the target GPU device.

建立虚拟GPU设备与目标GPU设备的直通连接的过程通过以下源码实现：The process of establishing a direct connection between the virtual GPU device and the target GPU device is implemented through the following source code:

在上述源码中，首先获取目标GPU设备的总线地址；通过访问总线地址获取目标GPU设备的功能端口数量；针对每一个功能端口，生成对应的xml配置。对于每一个功能端口，都需要配置domain(网域地址)、bus(总线地址)、slot(接口位置)、与function(功能端口数量)。在生成功能端口对应的xml配置后，将xml配置添加到目标GPU设备所处内存区间的xml文件(iommuGroup_number.xml)中。在对功能端口进行配置后，利用设备传输驱动(vfio)，将功能端口映射到目标虚拟机中，使得生成的虚拟GPU设备的每一个属性值地址与功能端口的属性值地址相同。In the above source code, first obtain the bus address of the target GPU device; obtain the number of functional ports of the target GPU device by accessing the bus address; and generate the corresponding xml configuration for each functional port. For each functional port, domain (domain address), bus (bus address), slot (interface location), and function (number of functional ports) need to be configured. After generating the xml configuration corresponding to the function port, add the xml configuration to the xml file (iommuGroup_number.xml) in the memory area of the target GPU device. After configuring the function port, use the device transmission driver (vfio) to map the function port to the target virtual machine, so that each attribute value address of the generated virtual GPU device is the same as the attribute value address of the function port.

目标虚拟机110在想要调取连通的目标GPU设备中的算力资源进行计算时，可以直接访问自己空间内的虚拟GPU设备141的第一虚拟地址与第二虚拟地址，根据第一虚拟地址与总端口地址、第二虚拟地址与功能端口地址之间的映射关系，可以直接访问到目标GPU设备，无需再向服务器130发送访问目标GPU设备的请求。When the target virtual machine 110 wants to call the computing resources in the connected target GPU device for calculation, it can directly access the first virtual address and the second virtual address of the virtual GPU device 141 in its own space. According to the first virtual address The mapping relationship between the general port address, the second virtual address and the functional port address allows direct access to the target GPU device without sending a request to the server 130 to access the target GPU device.

在本实施例中，通过虚拟GPU设备141与目标GPU设备的直通连接，目标虚拟机110可以直接访问真实的目标GPU的硬件设备，不需要再向服务器130发送访问请求，减少了在目标虚拟机110访问目标GPU的过程中产生的资源损耗，以提高了访问效率。同时，每一个目标虚拟机110上的虚拟GPU设备141直通连接唯一一个目标GPU设备，通过虚拟GPU设备141的地址可以直接访问到目标GPU设备的物理地址，目标虚拟机110无法操作与其他虚拟机连接的GPU设备141，避免虚拟机访问错误的GPU设备141，提高了GPU算力资源调度过程中数据传输的独立性。In this embodiment, through the direct connection between the virtual GPU device 141 and the target GPU device, the target virtual machine 110 can directly access the hardware device of the real target GPU without sending an access request to the server 130, which reduces the number of requests in the target virtual machine. 110 reduces the resource consumption generated during access to the target GPU to improve access efficiency. At the same time, the virtual GPU device 141 on each target virtual machine 110 is directly connected to the only target GPU device. The physical address of the target GPU device can be directly accessed through the address of the virtual GPU device 141. The target virtual machine 110 cannot operate with other virtual machines. The connected GPU device 141 prevents the virtual machine from accessing the wrong GPU device 141 and improves the independence of data transmission in the GPU computing resource scheduling process.

在目标虚拟机110的角度，从发起GPU算力资源请求到建立与目标GPU设备的直通连接的过程如图6所示，目标虚拟机110向服务器130上的设备管理驱动发起GPU算力资源请求；设备管理驱动获取确定好的目标GPU设备的配置文件，并加载目标GPU设备的配置文件；加载完成后，启动目标虚拟机110对应的设备传输驱动；设备传输驱动将目标GPU设备映射在目标虚拟机110上，建立目标GPU设备与目标虚拟机110的直通连接。From the perspective of the target virtual machine 110, the process from initiating a GPU computing resource request to establishing a direct connection with the target GPU device is shown in Figure 6. The target virtual machine 110 initiates a GPU computing resource request to the device management driver on the server 130. ; The device management driver obtains the determined configuration file of the target GPU device and loads the configuration file of the target GPU device; after the loading is completed, starts the device transfer driver corresponding to the target virtual machine 110; the device transfer driver maps the target GPU device to the target virtual machine On the machine 110, establish a direct connection between the target GPU device and the target virtual machine 110.

在一个实施例中，如图7所示，在利用设备管理驱动加载配置文件，利用设备传输驱动建立目标虚拟机与目标GPU设备的直通连接之后，GPU算力资源调度方法还包括：In one embodiment, as shown in Figure 7, after using the device management driver to load the configuration file and using the device transmission driver to establish a direct connection between the target virtual machine and the target GPU device, the GPU computing resource scheduling method also includes:

步骤710、实时监测目标GPU设备的算力资源使用情况；Step 710: Monitor the computing resource usage of the target GPU device in real time;

步骤720、将算力资源使用情况可视化呈现；Step 720: Visually present the usage of computing resources;

步骤730、基于算力资源使用情况生成GPU设备算力资源使用特征。Step 730: Generate GPU device computing resource usage characteristics based on computing resource usage.

在本实施例中，在建立目标虚拟机110与目标GPU设备的直通连接后，对目标GPU设备的算力资源使用情况进行实时监测。In this embodiment, after the direct connection between the target virtual machine 110 and the target GPU device is established, the computing resource usage of the target GPU device is monitored in real time.

将算力资源使用情况可视化展示给后台操作人员，以便后台操作人员可以随时监测目标GPU设备。将算力资源使用情况可视化呈现可以通过web服务器。web服务器与用于GPU算力资源调度的服务器是同一台，可以使用独立的web服务器进行可视化处理。Visually display the computing resource usage to the backend operators so that the backend operators can monitor the target GPU device at any time. The usage of computing resources can be visualized through the web server. The web server is the same server used for GPU computing resource scheduling, and an independent web server can be used for visual processing.

将算力资源使用情况利用数据库处理工具进行数据统计，并生成GPU设备141算力资源使用特征。数据库处理工具可以是MySQL数据库。GPU设备141算力资源使用特征可以包括，目标GPU设备在应对目标虚拟机110上的待处理任务时的算力消耗情况、处理速度、处理准确度等维度。Use a database processing tool to perform data statistics on computing resource usage, and generate computing resource usage characteristics of the GPU device 141. The database processing tool can be a MySQL database. The computing power resource usage characteristics of the GPU device 141 may include dimensions such as computing power consumption, processing speed, and processing accuracy of the target GPU device when dealing with tasks to be processed on the target virtual machine 110 .

本实施例的系统构架如图8所示，服务器130用于对算力资源使用情况进行实时监测；数据库处理工具150用于将算力资源使用情况数据化处理，生成算力资源使用特征；可视化服务器160用于将算力资源使用情况可视化呈现。The system architecture of this embodiment is shown in Figure 8. The server 130 is used to monitor the usage of computing resources in real time; the database processing tool 150 is used to digitize the usage of computing resources and generate computing resource usage characteristics; visualization The server 160 is used to visually present the usage of computing resources.

本实施例的优点是，方便后台操作人员可以实时获取目标GPU设备的算力资源使用情况，为之后的GPU算力资源调度提供更多经验数据，进而逐步提高GPU算力资源调度方法的调度准确性。The advantage of this embodiment is that it is convenient for background operators to obtain the computing resource usage of the target GPU device in real time, providing more empirical data for subsequent GPU computing resource scheduling, and thereby gradually improving the scheduling accuracy of the GPU computing resource scheduling method. sex.

基于本实施例中生成的GPU设备141算力资源使用特征，在一个实施例中，在基于算力资源使用情况生成GPU设备141算力资源使用特征之后，GPU算力资源调度方法还包括：利用算力资源使用特征训练目标GPU设备确定模型。在步骤220的一个实施例中，目标GPU设备确定模型用于根据GPU算力资源请求确定目标GPU设备。目标GPU设备确定模型就是通过算力资源使用特征数据进行训练的。这使得训练出的模型以实际GPU算力资源调度的经验为基础，更加贴合服务器130中GPU算力资源调度场景，提高了确定目标GPU设备的准确性。Based on the computing resource usage characteristics of the GPU device 141 generated in this embodiment, in one embodiment, after generating the computing resource usage characteristics of the GPU device 141 based on the computing resource usage, the GPU computing resource scheduling method also includes: utilizing Computing resources use features to train the target GPU device to determine the model. In one embodiment of step 220, the target GPU device determination model is used to determine the target GPU device according to the GPU computing resource request. The target GPU device determination model is trained using feature data through computing resources. This enables the trained model to be based on the actual GPU computing resource scheduling experience and more closely fit the GPU computing resource scheduling scenario in the server 130, thereby improving the accuracy of determining the target GPU device.

步骤250中，在建立目标虚拟机110与目标GPU设备的直通连接后，目标GPU设备通过直接存储器访问将资源数据透传给目标虚拟机110。透传是指在数据传输过程中，将数据原封不动地传输到目标虚拟机110，不对数据进行任何更改。使用透传的优点是，保证了数据的完整性和准确性。In step 250, after the direct connection between the target virtual machine 110 and the target GPU device is established, the target GPU device transparently transmits the resource data to the target virtual machine 110 through direct memory access. Transparent transmission means that during the data transmission process, the data is transmitted to the target virtual machine 110 intact without any changes to the data. The advantage of using transparent transmission is that it ensures the integrity and accuracy of data.

在一个实施例中，如图9所示，利用设备传输驱动将目标GPU设备中的资源数据透传给目标虚拟机，包括：In one embodiment, as shown in Figure 9, the device transmission driver is used to transparently transmit resource data in the target GPU device to the target virtual machine, including:

步骤910、将目标GPU设备中的资源数据封装为资源包，对资源包进行加密；Step 910: Encapsulate the resource data in the target GPU device into a resource package, and encrypt the resource package;

步骤920、将加密后的数据透传给目标虚拟机。Step 920: Transparently transmit the encrypted data to the target virtual machine.

在本实施例中，为了进一步保证数据传输的安全性，将资源数据进行加密传输；又为了保证在加密过程中数据不被修改，在进行加密之前，将资源数据封装为资源包，之后再对资源包进行加密。加密方式可以有多种，例如，利用哈希算法进行加密、或者对称加密等。In this embodiment, in order to further ensure the security of data transmission, the resource data is encrypted and transmitted; and in order to ensure that the data is not modified during the encryption process, the resource data is encapsulated into a resource package before encryption, and then the resource data is encapsulated into a resource package. The resource package is encrypted. There can be many encryption methods, such as encryption using a hash algorithm or symmetric encryption.

在目标虚拟机110接收到加密后的资源包后，首先基于加密算法对资源包进行解密，再获取到来自目标GPU的资源数据。After the target virtual machine 110 receives the encrypted resource package, it first decrypts the resource package based on the encryption algorithm, and then obtains the resource data from the target GPU.

本实施例的优点是，提高了数据透传过程中的安全性。The advantage of this embodiment is that the security during data transparent transmission is improved.

步骤260中，当目标虚拟机110利用目标GPU设备完成待处理事务后，目标虚拟机110向服务器130发送解绑请求，服务器130将配置文件中命令目标GPU设备与目标虚拟机110进行绑定的指令修改为命令目标GPU设备与目标虚拟机110进行解绑。配置文件修改完成后，由设备管理驱动再一次运行配置文件，就可以解除目标虚拟机110与目标GPU设备之间的直通连接。在目标GPU设备与目标虚拟机110解绑后，需要再一次与服务器130进行绑定，以便应对之后的GPU算力资源请求。In step 260, after the target virtual machine 110 completes the pending transaction using the target GPU device, the target virtual machine 110 sends an unbinding request to the server 130, and the server 130 binds the command in the configuration file to the target GPU device and the target virtual machine 110. The instruction is modified to instruct the target GPU device to unbind from the target virtual machine 110 . After the configuration file is modified, the device management driver runs the configuration file again to release the pass-through connection between the target virtual machine 110 and the target GPU device. After the target GPU device is unbound from the target virtual machine 110, it needs to be bound to the server 130 again in order to respond to subsequent GPU computing resource requests.

但是，可能会出现目标虚拟机110利用目标GPU设备完成待处理事务后，没有即时向服务器130发送解绑请求，导致目标GPU设备算力资源空闲。因此，在一个实施例中，如图10所示，在目标虚拟机110使用目标GPU设备中的算力资源计算结束后，修改配置文件，并利用设备管理驱动加载配置文件，解除目标虚拟机110与目标GPU设备的直通连接，包括：However, it may happen that after the target virtual machine 110 completes the pending transaction using the target GPU device, it does not immediately send an unbinding request to the server 130, resulting in idle computing resources of the target GPU device. Therefore, in one embodiment, as shown in Figure 10, after the target virtual machine 110 uses the computing resources in the target GPU device to complete the calculation, the configuration file is modified, and the device management driver is used to load the configuration file to release the target virtual machine 110. Pass-through connection to target GPU device, including:

步骤1010、按照预定周期检测目标GPU设备的算力资源使用情况；Step 1010: Detect the computing resource usage of the target GPU device according to a predetermined period;

步骤1020、如果目标GPU设备停止计算，在预定时间段后再次检测，如果目标GPU设备依旧停止计算，修改配置文件；Step 1020. If the target GPU device stops computing, detect again after a predetermined period of time. If the target GPU device still stops computing, modify the configuration file;

步骤1030、利用设备管理驱动加载配置文件，解除目标GPU设备与目标虚拟机之间的直通连接。Step 1030: Use the device management driver to load the configuration file and release the direct connection between the target GPU device and the target virtual machine.

在本实施例中，对目标GPU设备上的算力资源使用情况做周期性检测。例如，每隔1小时查看一次算力资源使用情况。如果目标GPU设备停止计算，那么表示目标GPU设备可能中断使用，或者已经结束使用。此时启动预定时间段，例如，在20分钟后再一次检测目标GPU设备的算力资源使用情况。如果在预定时间段后再次检测时发现目标GPU设备已经恢复计算，那么代表目标GPU设备仍被使用；如果发现目标GPU设备依旧停止计算，那么代表目标GPU设备很大概率已经完成了计算任务，此时，修改配置文件。In this embodiment, the usage of computing resources on the target GPU device is periodically detected. For example, check the computing resource usage every hour. If the target GPU device stops computing, it means that the target GPU device may be out of use or has finished using it. At this time, a predetermined time period is started, for example, the computing resource usage of the target GPU device is detected again after 20 minutes. If the target GPU device is detected again after a predetermined period of time and is found to have resumed computing, it means that the target GPU device is still being used; if it is found that the target GPU device still stops computing, it means that the target GPU device has most likely completed the computing task. , modify the configuration file.

通过设备管理确定加载修改后的配置文件，解除目标虚拟机110与目标GPU设备的直通连接。Determine to load the modified configuration file through device management, and release the direct connection between the target virtual machine 110 and the target GPU device.

本实施例的优点是，避免因目标虚拟机110未主动提出解绑请求而造成GPU算力资源浪费，同时，采用两次检测确保目标GPU设备已完成计算任务，避免对目标虚拟机110中的事务处理产生影响。The advantage of this embodiment is to avoid the waste of GPU computing resources caused by the target virtual machine 110 not actively making an unbinding request. At the same time, two detections are used to ensure that the target GPU device has completed the computing task, so as to avoid damaging the target GPU device. Transaction processing has an impact.

如图11所示，示例性地展示了本公开实施例的整体构架。其中服务器130根据目标虚拟机110B的GPU算力资源请求确定目标GPU设备后，由设备管理确定加载配置文件，由设备传输驱动生成虚拟GPU设备141，在虚拟GPU设备141与目标GPU设备之间建立直通连接，目标GPU设备与服务器130解绑。当目标虚拟机110B访问虚拟GPU设备141时，也就是在访问目标GPU设备。本公开实施例还应用数据库处理工具实时统计目标GPU设备的算力资源使用情况，并通过可视化服务器130将算力资源使用情况的数据可视化呈现。As shown in FIG. 11 , the overall architecture of the embodiment of the present disclosure is exemplarily shown. After the server 130 determines the target GPU device according to the GPU computing resource request of the target virtual machine 110B, the device management determines to load the configuration file, and the device transmission driver generates the virtual GPU device 141, and establishes a virtual GPU device 141 between the virtual GPU device 141 and the target GPU device. With a pass-through connection, the target GPU device is unbound from the server 130 . When the target virtual machine 110B accesses the virtual GPU device 141, it is accessing the target GPU device. The embodiment of the present disclosure also applies database processing tools to make real-time statistics on the usage of computing resources of the target GPU device, and visually presents the data on the usage of computing resources through the visualization server 130 .

下面对本公开实施例的装置和设备进行描述。The apparatus and equipment of the embodiments of the present disclosure are described below.

可以理解的是，虽然上述各个流程图中的各个步骤按照箭头的表征依次显示，但是这些步骤并不是必然按照箭头表征的顺序依次执行。除非本实施例中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，上述流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时间执行完成，而是可以在不同的时间执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It can be understood that although the steps in each of the above flowcharts are shown in sequence according to the arrows, these steps are not necessarily executed in the order represented by the arrows. Unless explicitly stated in this embodiment, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in the above flow chart may include multiple steps or multiple stages. These steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution order of these steps or stages It does not necessarily need to be performed sequentially, but may be performed in turn or alternately with other steps or at least part of steps or stages in other steps.

需要说明的是，在本申请的各个具体实施方式中，当涉及到需要根据目标虚拟机属性信息或属性信息集合等与目标虚拟机特性相关的数据进行相关处理时，都会先获得目标虚拟机的许可或者同意，而且，对这些数据的收集、使用和处理等，都会遵守相关国家和地区的相关法律法规和标准。此外，当本申请实施例需要获取目标虚拟机属性信息时，会通过弹窗或者跳转到确认页面等方式获得目标虚拟机的单独许可或者单独同意，在明确获得目标虚拟机的单独许可或者单独同意之后，再获取用于使本申请实施例能够正常运行的必要的目标虚拟机相关数据。It should be noted that in each specific implementation manner of the present application, when it is necessary to perform relevant processing based on the target virtual machine attribute information or attribute information collection and other data related to the characteristics of the target virtual machine, the target virtual machine will first be obtained. Permission or consent, and the collection, use and processing of such data will comply with relevant laws, regulations and standards of relevant countries and regions. In addition, when the embodiment of the present application needs to obtain the attribute information of the target virtual machine, the individual permission or independent consent of the target virtual machine will be obtained through a pop-up window or jumping to a confirmation page. After explicitly obtaining the individual permission or independent consent of the target virtual machine, After agreeing, obtain the necessary target virtual machine-related data for normal operation of the embodiment of this application.

图12为本公开实施例提供的GPU算力资源调度装置1200的结构图。该GPU算力资源调度装置1200包括：Figure 12 is a structural diagram of a GPU computing resource scheduling device 1200 provided by an embodiment of the present disclosure. The GPU computing resource scheduling device 1200 includes:

接收单元1210，用于接收来自目标虚拟机的GPU算力资源请求；The receiving unit 1210 is used to receive the GPU computing resource request from the target virtual machine;

分配单元1220，用于根据GPU算力资源请求在多个GPU设备中选择目标GPU设备；The allocation unit 1220 is used to select a target GPU device among multiple GPU devices according to the GPU computing resource request;

生成单元1230，用于基于目标GPU设备，生成配置文件；Generating unit 1230, used to generate a configuration file based on the target GPU device;

直通建立单元1240，用于利用设备管理驱动加载配置文件，利用设备传输驱动建立目标虚拟机与目标GPU设备的直通连接；The pass-through establishment unit 1240 is used to load the configuration file using the device management driver, and establish a pass-through connection between the target virtual machine and the target GPU device using the device transmission driver;

透传单元1250，用于利用设备传输驱动将目标GPU设备中的资源数据透传给目标虚拟机；The transparent transmission unit 1250 is used to transparently transmit the resource data in the target GPU device to the target virtual machine using the device transmission driver;

直通解绑单元1260，用于在目标虚拟机使用目标GPU设备中的算力资源计算结束后，修改配置文件，并利用设备管理驱动加载配置文件，解除目标虚拟机与目标GPU设备的直通连接。The pass-through unbinding unit 1260 is used to modify the configuration file after the target virtual machine uses the computing resources in the target GPU device to complete the calculation, and uses the device management driver to load the configuration file to release the pass-through connection between the target virtual machine and the target GPU device.

可选地，多个GPU设备与服务器绑定；Optionally, multiple GPU devices are bound to the server;

直通建立单元1240还用于：The pass-through establishment unit 1240 is also used for:

获取目标虚拟机的节点地址；Get the node address of the target virtual machine;

解除目标GPU设备与服务器的绑定；Unbind the target GPU device from the server;

利用设备管理驱动加载配置文件，利用设备传输驱动按照节点地址建立目标虚拟机与目标GPU设备的直通连接。Use the device management driver to load the configuration file, and use the device transmission driver to establish a direct connection between the target virtual machine and the target GPU device according to the node address.

可选地，直通建立单元1240还用于：Optionally, the pass-through establishment unit 1240 is also used to:

获取目标GPU设备的总端口地址；Get the total port address of the target GPU device;

将总端口地址转化为第一虚拟地址；Convert the general port address into the first virtual address;

获取目标GPU设备的功能端口地址；Get the function port address of the target GPU device;

将功能端口地址转换为第二虚拟地址；Convert the functional port address to the second virtual address;

建立总端口地址与第一虚拟地址、功能端口地址与第二虚拟地址之间的映射关系。Establish a mapping relationship between the general port address and the first virtual address, and the functional port address and the second virtual address.

基于第一虚拟地址与第二虚拟地址，利用设备传输驱动在目标虚拟机中创建一个虚拟GPU设备；Based on the first virtual address and the second virtual address, use the device transfer driver to create a virtual GPU device in the target virtual machine;

建立虚拟GPU设备与目标GPU设备的直通连接。Establish a pass-through connection between the virtual GPU device and the target GPU device.

可选的，透传单元1250包括：Optionally, the transparent transmission unit 1250 includes:

将目标GPU设备中的资源数据封装为资源包，对资源包进行加密；Encapsulate the resource data in the target GPU device into a resource package and encrypt the resource package;

将加密后的资源包透传给目标虚拟机。Transparently transmit the encrypted resource package to the target virtual machine.

可选地，直通解绑单元1260还用于：Optionally, the pass-through unbundling unit 1260 is also used to:

按照预定周期检测目标GPU设备的算力资源使用情况；Detect the computing resource usage of the target GPU device according to a predetermined period;

如果目标GPU设备停止计算，在预定时间段后再次检测，如果目标GPU设备依旧停止计算，修改配置文件；If the target GPU device stops computing, detect again after a predetermined period of time. If the target GPU device still stops computing, modify the configuration file;

利用设备管理驱动加载配置文件，解除目标GPU设备与目标虚拟机之间的直通连接。Use the device management driver to load the configuration file and release the pass-through connection between the target GPU device and the target virtual machine.

可选地，GPU算力资源调度装置1200还包括：Optionally, the GPU computing resource scheduling device 1200 also includes:

监测单元(未示出)，用于实时监测目标GPU设备的算力资源使用情况；A monitoring unit (not shown), used to monitor the computing resource usage of the target GPU device in real time;

可视化单元(未示出)，用于将算力资源使用情况可视化呈现；A visualization unit (not shown), used to visually present the usage of computing resources;

特征生成单元(未示出)，用于基于算力资源使用情况生成GPU设备算力资源使用特征。A feature generation unit (not shown), configured to generate GPU device computing resource usage characteristics based on computing resource usage.

训练单元(未示出)，用于利用算力资源使用特征训练目标GPU设备确定模型；A training unit (not shown), configured to utilize computing power resources to use features to train the target GPU device determination model;

分配单元1220还用于：将GPU算力资源请求输入目标GPU设备确定模型，得到目标GPU设备。The allocation unit 1220 is also used to input the GPU computing resource request into the target GPU device determination model to obtain the target GPU device.

参照图13，图13为实现本公开实施例的终端的部分的结构框图，该终端包括：射频(Radio Frequency，简称RF)电路1310、存储器1315、输入单元1330、显示单元1340、传感器1350、音频电路1360、无线保真(wireless fidelity，简称WiFi)模块1370、处理器1380、以及电源1390等部件。本领域技术人员可以理解，图13示出的终端结构并不构成对手机或电脑的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Referring to Figure 13, Figure 13 is a structural block diagram of a part of a terminal that implements an embodiment of the present disclosure. The terminal includes: a radio frequency (Radio Frequency, RF for short) circuit 1310, a memory 1315, an input unit 1330, a display unit 1340, a sensor 1350, an audio Circuit 1360, wireless fidelity (WiFi) module 1370, processor 1380, power supply 1390 and other components. Those skilled in the art can understand that the terminal structure shown in Figure 13 is not limited to a mobile phone or a computer, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components.

RF电路1310可用于收发信息或通话过程中，信号的接收和发送，特别地，将基站的下行信息接收后，给处理器1380处理；另外，将设计上行的数据发送给基站。The RF circuit 1310 can be used to receive and transmit information or signals during a call. In particular, after receiving downlink information from the base station, it is processed by the processor 1380; in addition, the designed uplink data is sent to the base station.

存储器1315可用于存储软件程序以及模块，处理器1380通过运行存储在存储器1315的软件程序以及模块，从而执行终端的各种功能应用以及数据处理。The memory 1315 can be used to store software programs and modules. The processor 1380 executes various functional applications and data processing of the terminal by running the software programs and modules stored in the memory 1315 .

输入单元1330可用于接收输入的数字或字符信息，以及产生与终端的设置以及功能控制有关的键信号输入。具体地，输入单元1330可包括触控面板1331以及其他输入装置1332。The input unit 1330 may be used to receive input numeric or character information, and generate key signal input related to settings and function control of the terminal. Specifically, the input unit 1330 may include a touch panel 1331 and other input devices 1332.

显示单元1340可用于显示输入的信息或提供的信息以及终端的各种菜单。显示单元1340可包括显示面板1341。The display unit 1340 may be used to display input information or provided information as well as various menus of the terminal. The display unit 1340 may include a display panel 1341.

音频电路1360、扬声器1361，传声器1362可提供音频接口。The audio circuit 1360, speaker 1361, and microphone 1362 can provide an audio interface.

在本实施例中，该终端所包括的处理器1380可以执行前面实施例的GPU算力资源调度方法。In this embodiment, the processor 1380 included in the terminal can execute the GPU computing resource scheduling method in the previous embodiment.

本公开实施例的终端包括但不限于手机、电脑、智能语音交互设备、智能家电、车载终端、飞行器等。本发明实施例可应用于各种场景，包括但不限于y云计算、人工智能等。Terminals in embodiments of the present disclosure include but are not limited to mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle-mounted terminals, aircraft, etc. Embodiments of the present invention can be applied to various scenarios, including but not limited to cloud computing, artificial intelligence, etc.

图14为实施本公开实施例的服务器130的部分的结构框图。服务器130可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上中央处理器(CentralProcessing Units，简称CPU)1422(例如，一个或一个以上处理器)和存储器1432，一个或一个以上存储应用程序1442或数据1444的存储介质1430(例如一个或一个以上海量存储装置)。其中，存储器1432和存储介质1430可以是短暂存储或持久存储。存储在存储介质1430的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对服务器130中的一系列指令操作。更进一步地，中央处理器1422可以设置为与存储介质1430通信，在服务器130上执行存储介质1430中的一系列指令操作。FIG. 14 is a structural block diagram of a portion of the server 130 that implements an embodiment of the present disclosure. The server 130 may vary greatly due to different configurations or performance, and may include one or more central processing units (CPU) 1422 (for example, one or more processors) and memory 1432, one or more A storage medium 1430 (eg, one or more mass storage devices) that stores applications 1442 or data 1444. Among them, the memory 1432 and the storage medium 1430 may be short-term storage or persistent storage. The program stored in the storage medium 1430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server 130 . Furthermore, the central processor 1422 may be configured to communicate with the storage medium 1430 and execute a series of instruction operations in the storage medium 1430 on the server 130 .

服务器130还可以包括一个或一个以上电源1426，一个或一个以上有线或无线网络接口1450，一个或一个以上输入输出接口1458，和/或，一个或一个以上操作系统1441，例如Windows ServerTM，Mac OS XTM，UnixTM，LinuxTM，FreeBSDTM等等。Server 130 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input and output interfaces 1458, and/or, one or more operating systems 1441, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.

服务器130中的中央处理器1422可以用于执行本公开实施例的GPU算力资源调度方法。The central processor 1422 in the server 130 may be used to execute the GPU computing resource scheduling method according to the embodiment of the present disclosure.

本公开实施例还提供一种计算机可读存储介质，计算机可读存储介质用于存储程序代码，程序代码用于执行前述各个实施例的GPU算力资源调度方法。Embodiments of the present disclosure also provide a computer-readable storage medium. The computer-readable storage medium is used to store program codes. The program codes are used to execute the GPU computing resource scheduling methods of the aforementioned embodiments.

本公开实施例还提供了一种计算机程序产品，该计算机程序产品包括计算机程序。计算机设备的处理器读取该计算机程序并执行，使得该计算机设备执行实现上述的GPU算力资源调度方法。An embodiment of the present disclosure also provides a computer program product, which includes a computer program. The processor of the computer device reads the computer program and executes it, so that the computer device executes and implements the above GPU computing power resource scheduling method.

本公开的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本公开的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“包含”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或装置不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或装置固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if present) in the description of the present disclosure and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe specific objects. Sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the disclosure described herein, for example, can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "include" and "comprises" and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product or apparatus that includes a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.

应当理解，在本公开中，“至少一个(项)”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，用于描述关联对象的关联关系，表示可以存在三种关系，例如，“A和/或B”可以表示：只存在A，只存在B以及同时存在A和B三种情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达，是指这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b或c中的至少一项(个)，可以表示：a，b，c，“a和b”，“a和c”，“b和c”，或“a和b和c”，其中a，b，c可以是单个，也可以是多个。It should be understood that in this disclosure, “at least one (item)” refers to one or more, and “plurality” refers to two or more. "And/or" is used to describe the relationship between associated objects, indicating that there can be three relationships. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist simultaneously. , where A and B can be singular or plural. The character "/" generally indicates that the related objects are in an "or" relationship. “At least one of the following” or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items). For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c" ”, where a, b, c can be single or multiple.

应了解，在本公开实施例的描述中，多个(或多项)的含义是两个以上，大于、小于、超过等理解为不包括本数，以上、以下、以内等理解为包括本数。It should be understood that in the description of the embodiments of the present disclosure, the meaning of multiple (or multiple items) is two or more, greater than, less than, more than, etc. are understood as excluding the number, and above, below, within, etc. are understood as including the number.

在本公开所提供的几个实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided by this disclosure, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。A unit described as a separate component may or may not be physically separate. A component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or it may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本公开各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.

集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机装置(可以是个人计算机，服务器130，或者网络装置等)执行本公开各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products. Based on this understanding, the technical solution of the present disclosure is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, server 130, or network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., which can store program code. medium.

还应了解，本公开实施例提供的各种实施方式可以任意进行组合，以实现不同的技术效果。It should also be understood that the various implementation modes provided by the embodiments of the present disclosure can be combined arbitrarily to achieve different technical effects.

以上是对本公开的实施方式的具体说明，但本公开并不局限于上述实施方式，熟悉本领域的技术人员在不违背本公开精神的条件下还可作出种种等同的变形或替换，这些等同的变形或替换均包括在本公开权利要求所限定的范围内。The above is a specific description of the embodiments of the present disclosure, but the disclosure is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent modifications or substitutions without violating the spirit of the disclosure. These equivalents Variations or substitutions are included within the scope defined by the claims of this disclosure.