CN120492381A

Movatterモバイル変換

Info

Publication number: CN120492381A
Application number: CN202510983939.2A
Authority: CN
Inventors: 张静东; 王江为; 王彦伟; 李霞
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2025-07-17
Filing date: 2025-07-17
Publication date: 2025-08-15

Abstract

The invention discloses an acceleration card, a control method thereof and an acceleration computing system, which relate to the technical field of computers, and because the acceleration card comprises a processor core, a memory controller, a first connector and a memory component is provided, the processor core of the acceleration card can mount the storage component based on the local storage controller to expand the local storage space, thereby realizing the acceleration card with both calculation performance and storage performance. The processor core is used for accessing the storage component through the storage controller via the first connector, establishing a first mapping relation between an address space of the processor core and an address of the storage component, executing a read-write task on the storage component based on the first mapping relation, and responding the read-write task by the storage controller to execute the read-write operation on the storage component, so that the utilization rate of single computing power resources can be improved in distributed computing, the communication quantity among acceleration cards is reduced, the communication bottleneck is broken through, and the distributed computing efficiency is improved.

Description

Acceleration card, control method thereof and acceleration computing system

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an acceleration card, a control method thereof, and an acceleration computing system.

Background

With the development of artificial intelligence technology, the demand for computing power has increased greatly. The server host is connected with the acceleration cards and the interconnection between the acceleration cards to construct a large-scale acceleration computing cluster, so that distributed computing is realized, and the computing problem of a large-scale artificial intelligent model can be solved. However, accelerator cards with high computing power often have limited storage resources and require the mounting of memory expansion cards as expansion memory for accelerator cards. However, when the accelerator card accesses such an extended memory, it is necessary to perform a memory copy operation through a central processing unit (Central Processing Unit, CPU), which lengthens an access path, increases a delay, and also limits the capacity of the extended memory, resulting in a bottleneck of distributed computing performance.

Disclosure of Invention

The invention provides an acceleration card, a control method thereof and an acceleration computing system, which are used for at least solving the problem that the path of the acceleration card for accessing an extended memory is overlong in the related art.

The invention provides an acceleration card, which comprises a processor core, a storage controller, a first connector and a storage component, wherein the processor core is connected with the storage controller;

The memory controller is arranged between the processor core and the first connector, and the first connector is also connected with the memory component;

The processor core is used for accessing the storage component through the storage controller via the first connector, and establishing a first mapping relation between an address space of the processor core and an address of the storage component so as to execute a read-write task on the storage component based on the first mapping relation;

the storage controller is used for responding to the read-write task to execute read-write operation on the storage component.

The invention also provides an acceleration computing system comprising a plurality of interconnected acceleration cards;

the accelerator card comprises a processor core, a memory controller, a first connector and a memory component;

The processor core is used for accessing the storage component through the storage controller, and establishing a first mapping relation between an address space of the processor core and an address of the storage component so as to execute a read-write task on the storage component based on the first mapping relation;

The invention also provides a control method of the acceleration card, which is applied to a processor core of the acceleration card and comprises the following steps:

accessing the storage component through a storage controller to acquire storage resource information of the storage component;

establishing a first mapping relation between an address space of the processor core and an address of the storage component according to the storage resource information;

Executing a read-write task on the storage component based on the first mapping relation;

the memory controller is disposed between the processor core and a first connector of the accelerator card, the first connector also being connected to the memory component.

According to the invention, as the acceleration card comprising the processor core, the memory controller, the first connector and the memory component is provided, the memory controller is arranged between the processor core and the first connector, and the first connector is also connected with the memory component, based on the acceleration card, the processor core of the acceleration card can mount the memory component based on the local memory controller to expand the local memory space, and the acceleration card with both calculation performance and memory performance is realized. The processor core is used for accessing the storage component through the storage controller via the first connector, establishing a first mapping relation between an address space of the processor core and an address of the storage component, executing a read-write task on the storage component based on the first mapping relation, and responding the read-write task by the storage controller to execute the read-write operation on the storage component, so that the utilization rate of single computing power resources can be improved in distributed computing, the communication quantity among acceleration cards is reduced, the communication bottleneck is broken through, and the distributed computing efficiency is improved.

Drawings

For a clearer description of embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

Fig. 1 is a schematic structural diagram of an accelerator card according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a computing processor according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an acceleration inter-card interconnect according to an embodiment of the present invention;

The system comprises a speed-up card 100, a computing processor 101, memory particles 103, a first nonvolatile storage device 104, a first connector 105, a third connector 106, an optical module 107, a management controller 108, a power module 109, a first golden finger interface, a server host 200, an Ethernet switch 300 and a unified address management server 400.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present invention.

It should be noted that in the description of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The present invention will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present invention.

In cloud computing and artificial intelligence computing, the modern data center adopts more and more acceleration cards, and the computing uses more and more processor cores, so that the allocated memory size and bandwidth of each processor core are not correspondingly improved. The artificial intelligent model calculation needs more accelerator cards and memory units to support, the calculation power of a single accelerator card is limited, and a plurality of accelerator cards are needed to cooperatively calculate, so that the inter-accelerator card bus interconnection communication is involved, and intermediate calculation cache data is transmitted. The memory that can be accessed by a single accelerator card is limited, and the memory capacity of the accelerator card is continuously expanded, so that more memory sharing among the accelerator cards is realized.

Computer fast interconnect (Compute Express Link, CXL) is a high-speed serial protocol that allows for fast, reliable data transfer between different components within a computer system. The method aims to solve bottleneck problems in high-performance calculation, including problems of memory capacity, memory bandwidth, I/O delay and the like. The computer quick interconnection can also realize memory expansion and memory sharing, and can communicate with the external devices such as a computing accelerator (such as a GPU (graphics processing unit), an FPGA (field programmable gate array) and the like, thereby providing a faster and more flexible data exchange and processing mode.

In order to solve the problem that an accelerator card accesses a path process of an extended memory when the memory is extended by using the memory extended card as the accelerator card, the embodiment of the invention provides the accelerator card which comprises a processor core, a memory controller, a first connector and a memory component, wherein the memory controller is arranged between the processor core and the first connector, and the first connector is also connected with the memory component. The processor core is used for accessing the storage component through the storage controller via the first connector, establishing a first mapping relation between an address space of the processor core and an address of the storage component, executing a read-write task on the storage component based on the first mapping relation, and responding the read-write task by the storage controller to execute the read-write operation on the storage component, so that the utilization rate of single computing power resources can be improved in distributed computing, the communication quantity among acceleration cards is reduced, the communication bottleneck is broken through, and the distributed computing efficiency is improved.

Fig. 1 is a schematic structural diagram of an acceleration card according to an embodiment of the present invention, and fig. 2 is a schematic structural diagram of a computing processor according to an embodiment of the present invention.

The accelerator card provided by the embodiment of the invention can comprise a processor core, a storage controller, a first connector and a storage component, wherein the storage controller is arranged between the processor core and the first connector, the first connector is also connected with the storage component, the processor core is used for accessing the storage component through the storage controller through the first connector, a first mapping relation between an address space of the processor core and an address of the storage component is established so as to execute a read-write task on the storage component based on the first mapping relation, and the storage controller is used for responding to the read-write task so as to execute read-write operation on the storage component.

In embodiments of the invention, the accelerator card may refer to a graphics processor (Graphics Processing Unit, GPU).

The external interfaces of accelerator cards typically include a first interface for connecting to a server host and an inter-card interconnect connector for connecting to other accelerator cards. The first interface is typically a PCIe interface, and is inserted in the form of a golden finger into a first slot (PCIe slot) of the server host. The inter-card interconnect connector may be an inter-card high-speed interconnect connector NVLink. The different acceleration cards in the same server can be directly interconnected through the inter-card interconnection connector, and the acceleration cards of different servers can be interconnected through the inter-card interconnection connector and a switching controller matched with NVLink.

In the embodiment of the present invention, the first connector for connecting the storage component may be an inter-card interconnection connector of the accelerator card, and the storage controller is a trans-board card forwarding control module corresponding to the first connector. That is, one or more inter-card interconnect connectors of the accelerator card may be configured as a first connector for connecting to the memory component, and a trans-board card forwarding control module of the accelerator card that is originally used to control the inter-card interconnect connector to perform the inter-card interconnect function may be configured as a memory controller.

In other optional implementations of the embodiments of the present invention, if an inter-card interconnect connector of the accelerator card is used as the first connector for connecting the storage component, the storage controller may be additionally deployed based on hardware resources of the computing processor of the accelerator card, and the connection between the inter-card interconnect connector and the inter-board forwarding control module may be changed to be connected to the storage controller.

In the embodiment of the invention, the first connector for connecting the storage component can also adopt a first pin in a first interface of the accelerator card, wherein the first interface is an interface of the accelerator card for connecting the server host, and the storage controller is connected with the storage component through the first pin, an on-board wiring of the server host and a first slot of the server host, and the first slot is used for installing the storage component. The first interface is an interface of the accelerator card for connecting with the server host, and can be configured into two lines under the condition that the number of channels of the first interface is enough, one line is still used for connecting with the server host, the other line is used for connecting with the first slot through the on-board wiring on the server host, and the storage component is arranged on the first slot, so that the storage server is connected with the storage component through the first connector.

Therefore, the problem that the computing power utilization rate is low due to the fact that the computing power of the accelerator card is not matched with the memory resources can be solved by utilizing the external interface resources of the accelerator card to be configured as the first connector for connecting the storage component and realizing the storage controller based on the computing processor of the accelerator card.

In the embodiment of the invention, the processor core can also be obtained by programming the example of realizing the accelerator card based on the logic circuit of the first controller. The first controller may be a programmable controller, such as a field programmable gate array (Field Programmable GATE ARRAY, FPGA), or may be other types of programmable controllers.

The logic circuit based on the programmable controller provides hardware resources, an acceleration card instance can be built according to the open source code of the acceleration card to serve as a processor core, other hardware resources on the first controller, such as serial-parallel conversion channels (Serdes), computer quick interconnection controller (CXL IP) resources, storage controller resources and the like, can be utilized to mount a storage component with larger capacity, and the purpose of directly mounting the acceleration card instance with high calculation power into an extended memory is achieved, so that the problem that the calculation power utilization rate is low due to the fact that the calculation power of the acceleration card is not matched with the memory resources in the related technology is solved.

As shown in fig. 1, a processor core and a memory controller may be provided in a computing processor 101.

Computing processor 101 may include one or more processor cores (processor core 1, processor core 2, a..the., processor core N, as shown in fig. 2).

If the computing processor 101 includes multiple processor cores, the multiple processor cores may share the memory resources of the memory component, or may divide the memory resources into memory resources corresponding to different processor cores, or the memory component may include the memory resources that are shared or unique to the processor cores.

As shown in fig. 1, the accelerator card 100 may include a first golden finger interface 109, where the first golden finger interface 109 may be a 16-channel fifth-generation PCIe (PCIe Gen5 x 16) standard golden finger interface, and the accelerator card 100 is inserted into a PCIe slot of the server host 200 through the first golden finger interface 109. If the first connector 104 is implemented by a pin of the first interface, that is, a pin of the first golden finger interface 109.

In the embodiment of the present invention, the connector of the accelerator card 100 for inter-card interconnection may be described as a second connector.

The second connector may include a third connector 105 for enabling interconnection of different accelerator cards 100 within the same server, and the different accelerator cards 100 within the same server may be interconnected by the third connector 105 and a cable.

The third connector 105 may be a NVLink connector, so that different accelerator cards 100 in the same server may be directly connected through the third connector 105.

The third connector 105 may also employ a Multi-Channel input/output (MCIO) connector, and the cable may be a PCIe bus. The third connector 105 of the accelerator card 100 may be connected to the switch controller through a cable, and interconnection between the accelerator card 100 and between the accelerator card 100 and the server host 200 is achieved via the switch controller.

In other alternative implementations of embodiments of the invention, at least one thread of at least one processor core runs a Root Complex (RC) bridge driver to configure the storage component as an Endpoint (EP) device. Based on this, the processor core can actively access the memory component of another accelerator card 100 through the direct connection relationship between the accelerator cards 100 without going through the server host 200, the switch controller or the switch card, thereby further shortening the access path and reducing the access delay. Through the third connector 105, a Multi-Channel I/O (MCIO) connector is adopted, each MCIOx interface contains 4 pairs of high-speed serial-parallel conversion (SerDes) channels, and each Channel bandwidth supports 50Gb/s, so that the unidirectional bandwidth of each MCIOx interface is 200Gb/s, the bidirectional bandwidth is 400Gb/s, and the direct-connection large-bandwidth communication requirement between cards can be met. Any two cards in the node can be directly communicated through a multichannel connection technology (MC-link), so that system integration is facilitated, a switch and a transfer card are not needed, and the high-bandwidth low-delay characteristic is achieved.

As shown in fig. 1, the second connector of the accelerator card 100 for implementing inter-card interconnection may further include an optical module 106, where the accelerator card 100 may access the unified address management server 400 through the optical module 106 to report local storage resources, accept that the unified address management server 400 allocates supernode unified address spaces for the plurality of accelerator cards 100, and initialize a unified address conversion table (Unified Address Translation Table, UATT) based on the allocated supernode unified address spaces and the address space of the local processor core, which is denoted as a second mapping relationship. Thus, based on the global unified address, the interconnection between the accelerator cards 100 in the same server can be realized, the interconnection between the accelerator cards 100 across servers can be realized, and the server host 200 can access the accelerator cards 100 based on the global unified address. The optical module 106 may employ a 400G fiber optic network module to access the fiber optic network through an optical module interface as shown in fig. 2.

In the embodiment of the present invention, the optical module 106 may also adopt an inter-card interconnection connector of an accelerator card, the inter-card interconnection connector is configured as the optical module 106, one or more inter-card interconnection connectors of the accelerator card 100 may be configured as the optical module 106, and a trans-board card forwarding control module originally used in the accelerator card for controlling the inter-card interconnection connector to implement the inter-card interconnection function is configured as the super node forwarding module provided in the embodiment of the present invention or a hardware resource of the computing processor 101 based on the accelerator card 100 is configured as the extra super node forwarding module, so that a larger communication bandwidth is implemented when inter-server inter-card interconnection is implemented.

In the embodiment of the present invention, at least one thread of at least one processor core of the accelerator card 100 runs an RC bridge driver, different accelerator cards 100 located in the same server may be directly interconnected through the first connector 104 and the cable, and different accelerator cards 100 across the server may be interconnected point-to-point through the optical module 106, or may be interconnected in many-to-many through the optical module 106 and the ethernet switch.

In the embodiment of the invention, the processor core executes the read-write task on the storage component based on the first mapping relation, and the method can comprise the steps that the processor core reports the storage resource information of the acceleration card 100 to the unified address management server 400, receives the supernode unified address space distributed by the unified address management server 400 for the acceleration card 100, initializes the second mapping relation between the supernode unified address space and the address space of the processor core, and executes the read-write task on the storage component based on the second mapping relation and the first mapping relation.

As shown in fig. 1, the accelerator card 100 may further include a management controller 107, where the management controller 107 is connected to the processor core of the accelerator card 100 for monitoring the status of components of the accelerator card. The management controller 107 may be connected to the computing processor 101, the power module 108, and other components of the accelerator card 100, and is configured to monitor an operation state of each component, and perform fault detection, fault recording, fault reporting, and so on. The management controller 107 may employ a microcontroller (Micro Controller Unit, MCU) to collect information such as voltage, current, temperature, etc. of the components on the accelerator card 100 via sensors.

The acceleration card provided by the embodiment of the invention comprises a processor core, a storage controller, a first connector and a storage component, wherein the storage controller is arranged between the processor core and the first connector, and the first connector is also connected with the storage component. The processor core is used for accessing the storage component through the storage controller via the first connector, establishing a first mapping relation between an address space of the processor core and an address of the storage component, executing a read-write task on the storage component based on the first mapping relation, and responding the read-write task by the storage controller to execute the read-write operation on the storage component, so that the utilization rate of single computing power resources can be improved in distributed computing, the communication quantity among acceleration cards is reduced, the communication bottleneck is broken through, and the distributed computing efficiency is improved.

Based on the above embodiments, as shown in fig. 1, in the accelerator card 100 provided in the embodiment of the present invention, the storage controller may include a memory access controller (not shown in fig. 1), and the storage component may be a memory granule 102.

The memory particles 102 may be In the form of Double Data Rate (DDR) memory, in the form of Dual inline memory modules (Dual Inline Memory Module, DIMMs), in the form of multiple memory particles 102 integrated on a circuit board, in the form of registered Dual inline memory modules (REGISTERED DIMM, RDIMM), unbuffered Dual inline memory modules (Unbuffered DIMMs, UDIMM), mini Dual In-line Memory Module, mini-DIMMs, and other types of memory particles 102 In other forms of packages.

As shown in fig. 2, a memory access controller is provided in the computing processor 101 for interfacing with the memory granule 102. The memory access controller is used for analyzing and forwarding the access request information and the access response information of the storage component. The access request information may include an access type and an access address.

In order to further expand the storage resources of the accelerator card 100, in the accelerator card 100 provided by the embodiment of the invention, the storage controller may include a memory access controller and a storage access controller, the storage component includes a memory granule 102 and a first nonvolatile storage device 103, the memory access controller is further connected to the memory granule 102 through a corresponding first connector 104, a first end of the storage access controller is connected to the memory access controller, and a second end of the storage access controller is connected to the first nonvolatile storage device 103 through a corresponding first connector 104.

The memory access controller analyzes and forwards the access request information and the access response information to the storage component, which may include the memory access controller analyzing the access request information, if the access address in the access request information hits the memory, reading the corresponding access response information from the memory grain 102 and returning the access response information to the memory access controller, if the access address in the access request information does not hit the memory, forwarding the access request information to the storage access controller, and buffering the response data in the access response information returned by the storage access controller in the memory grain 102. The memory access controller is configured to read corresponding response data from the first nonvolatile memory device 103 according to an access address in the access request information sent by the memory access controller, and return the response data. Thus, the first nonvolatile storage device 103 can be utilized to expand larger storage resources for the accelerator card 100, and the processing efficiency of the access task can be improved based on the memory-storage architecture.

In the embodiment of the present invention, the first nonvolatile storage device 103 may be, but is not limited to, a solid state disk (such as a nonvolatile Memory flash solid state disk NVMe SSD), a flash Memory (NAND FLASH Memory), an electrically erasable programmable read Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), or other types of nonvolatile storage devices may be used.

The accelerator card 100 provided in the embodiment of the invention can directly extend the first nonvolatile storage device 103, such as an NVMe solid state disk, by adopting the first connector 104, so that the coupling degree between the accelerator card 100 and a server when using an extended memory is reduced, the delay of accessing the extended memory by a processor core of the accelerator card 100 is reduced, and the local storage capacity of the accelerator card 100 is increased.

The above embodiments have described that the storage component may act as a shared storage for multiple processor cores. If the memory granule 102 is used as the shared memory of a plurality of processor cores, as shown in fig. 2, the memory controller may further include a shared memory controller disposed between the processor cores and the memory access controller, where the shared memory controller is configured to perform a first data movement task between the server host 200 and the accelerator card 100 and/or a second data movement task between different processor cores in the accelerator card 100.

The shared memory controller can realize the shared memory based on a computer quick interconnection memory (CXL.mem) protocol, and is used as a DMA controller, an extended DDR4 DIMM controller can be mounted, the DMA controller and the memory can realize caching partial data in the NVMe solid state disk, and the DMA controller is responsible for realizing data movement between a host and the accelerator card 100 and data movement in the accelerator card 100 and RDMA data movement through a network.

On the basis of the above embodiment, as shown in fig. 2, the accelerator card 100 may further include a port routing forwarding module, a cross-board card forwarding module and a second connector, where the port routing forwarding module is disposed between the processor core and the memory controller, and the port routing forwarding module is further connected to the cross-board card forwarding module, and the cross-board card forwarding module is connected to another accelerator card 100 through the second connector and is used for forwarding a data packet between the accelerator card 100 and another accelerator card 100, and the port routing forwarding module is used for performing format conversion of the data packet between the accelerator card 100 and another accelerator card 100.

In the embodiment of the present invention, the port routing forwarding module and the cross-board forwarding module may be disposed in the computing processor 101 and implemented based on logic circuit programming of the first accelerator card.

The port route forwarding module may be configured to convert a data packet based on a unified address that accesses the memory of the local other accelerator card 100 into a data packet based on a port ID and an offset address, and send the data packet to the cross board card forwarding module.

In the embodiment of the invention, the inter-board forwarding module may be an intra-node forwarding module, and the intra-node forwarding module is connected with another accelerator card 100 of the server through a second connector and a cable.

In the embodiment of the present invention, the intra-node forwarding module, that is, the intra-node forwarding media access Control Address (MAC) module, may be used to forward the data packet based on the port ID that accesses the memory of the local other accelerator card 100. For example, the memory of one accelerator card 100 may be accessed between two interconnected accelerator cards 100 in one server through a local in-node forwarding module.

In the embodiment of the invention, the cross board card forwarding module may be a supernode forwarding module, and the supernode forwarding module is connected with another accelerator card 100 of another server through a second connector.

In the embodiment of the present invention, the supernode forwarding module, i.e. the supernode forwarding MAC module, may be configured to send and receive a data packet from the optical module 106 accessing the local memory.

As shown in fig. 2, in the embodiment of the present invention, the second connector corresponding to the supernode forwarding module may be an optical module 106, and the accelerator card 100 further includes a remote direct memory access protocol stack module disposed between the supernode forwarding module and the storage controller. The light module 106 may be referred to in the description of the embodiments described above. In an embodiment of the present invention, the remote direct memory access protocol stack module may be used to run a remote direct memory access (Remote Direct Memory Access, RDMA) protocol based on ethernet to handle direct memory access (Direct Memory Access, DMA) requests from the extended memory of the network access accelerator card 100.

In the embodiment of the invention, in order to realize multi-to-multi interconnection of acceleration cards of a cross-server, the supernode forwarding module is connected with another acceleration card of another server through a second connector, and the supernode forwarding module can be connected to an Ethernet switch through an optical module 106 of the acceleration card and connected with an optical module 106 of another acceleration card of another server through the Ethernet switch.

The accelerator card 100 provided in the embodiment of the present invention may further include an arbiter disposed between the port routing forwarding module and the memory controller, where the arbiter is configured to arbitrate a plurality of access tasks to the memory component.

In the embodiment of the invention, the cross board card forwarding module can comprise an intra-node forwarding module and a super node forwarding module, wherein the intra-node forwarding module is connected with another accelerator card 100 in a server through a corresponding second connector and a cable, the super node forwarding module is connected with the other accelerator card 100 of the other server through a corresponding second connector, the accelerator card 100 further comprises a remote direct memory access protocol stack module, the remote direct memory access protocol stack module is arranged between the super node forwarding module and an arbiter, and the type of the access task to the storage component can be at least one of the access task of the server host 200 to the storage component, the access task of the other accelerator card 100 of the server to the storage component and the access task of the other accelerator card 100 of the other server to the storage component.

The accelerator card 100 provided in the embodiment of the present invention may further include a control register module connected to the processor core, the remote direct memory access protocol stack module, the arbiter, and the memory controller, where the control register module is configured to perform remote direct memory access register management on the remote direct memory access protocol stack module, the arbiter, and the memory controller by the processor core.

As shown in fig. 2, the control register module and port routing forwarding module may be mounted on the system bus, thereby shortening the communication path with the processor core.

Conventional accelerator cards, when communicating with resources outside of the server, require point-to-point communication via RDMA network cards and PCIe switches (PCIE SWITCH). The accelerator card 100 provided in the embodiment of the present invention has an optical module 106, and may access a converged ethernet RDMA version 2 (RDMA over Converged Ethernet version, rocev 2) network or a standard ethernet, to communicate with other computing nodes or storage nodes, where the computing processor 101 of the accelerator card 100 does not need to connect to an RDMA network card through PCIE SWITCH in a conventional scheme. By using the RoCE protocol stack based on the optical fiber network, the NMVe solid state disk can be remotely initialized and accessed, and the NVMe solid state disk is used as an extended memory which can be accessed by the processor core of the accelerator card 100, so that the shared memory capacity is greatly improved, and the data parallel splitting degree and the data traffic are reduced. And realizing multi-node unified memory management based on the 400G optical fiber network and the address management server, such as starting up self-checking and reporting the local memory size of each node. The 400G fiber network of each accelerator card 100 also serves as a redundant backup channel for inter-card direct channels, which can provide connectivity and flow diversion when any two card direct channels fail or are congested.

In summary, the accelerator card 100 provided in the embodiment of the invention may adopt optical modules 106 and 400G optical network interconnection, support heterogeneous memory expansion and mount expansion memory and NMVe solid-state memory on a processor core of the accelerator card 100, where the accelerator card 100 may include a computing processor 101, a first connector 104, a power management module, optical modules 106 of management controllers 107 and 400G, optical cables, PCIe gold fingers, an open memory interface (Open Memory Interface, OMI) memory module, DDR DIMM memory chips, and the like.

If the processor core is implemented by adopting the programming of the first controller, 4 accelerator cards 100 can be set in a single server, each accelerator card 100 has a golden finger interface supporting PCIe Gen5 x16 standard and MCIO x4 interfaces supporting PCIe Gen5 standard, 4 card golden fingers are respectively inserted into PCIe slots of the server host 200, and any two cards are connected through a high-speed MCIOx4 cable. The first MCIOx interface of each accelerator card 100 is connected to the NVMe solid state disk through a converter. Each MCIOx interface contains 4 pairs of high-speed SerDes channels, and the bandwidth of each channel supports 50Gb/s, so that the unidirectional bandwidth of each MCIOx interface is 200Gb/s, the bidirectional bandwidth is 400Gb/s, and the direct connection large-bandwidth communication requirement between cards can be met.

Fig. 3 is a schematic structural diagram of an acceleration inter-card interconnect according to an embodiment of the present invention.

In the following, taking 4 accelerator cards 100 (accelerator card 1, accelerator card 2, accelerator card 3, accelerator card 4) provided by the present invention as an example, a server host 200 of the two-way server includes a central processor 0 and a central processor 1, where the accelerator cards 100 are interconnected, and the connection relationship is as shown in fig. 3, and the direct inter-card communication of the accelerator cards 100, and the data flows when accessing the local extended memory and accessing the extended memory on other example cards in the node, and when remotely accessing the extended memory are as follows.

The accelerator card 100 is powered up and the server system loads the drive configuration file to complete initialization. The processor core 1 on each accelerator card 100 starts a thread to load the RC bridge driver, so as to complete the link initialization with the extended memory EP device, and the inter-board multi-channel (MC-link) physical layer completes the training and connection.

And the server system initiates initialization of an inter-board MC-link MAC layer and self-checking communication test of a transmission layer of the inter-board card of the node according to the default configuration board card interconnection topology.

The server nodes report the self-checking result and the storage resource size, the unified address management server 400 starts a unified address management allocation application program, allocates a section of super node unified address space for each server node storage, and initializes a unified address conversion table (Unified Address Translation Table, UATT) by each node server.

After each server node obtains the unified address space, the unified address space is allocated for the storage component of each accelerator card 100, and a Base ADDRESS MAPPING Table (BAMT) is initialized, the 64-bit address, and the low 48 bits are all 0, which is omitted. Table 1 shows an example of the address [55:48] bitmap of the four accelerator cards 100.

TABLE 1

The unified address [62:56] represents the ID of different server nodes, and [63] distinguishes between unified addresses or local addresses.

The expansion memory downloaded by accelerator card 100 may be accessed by a local central processor within the server node, accelerator card 100 local within the node, other accelerator cards 100 within the node, and other nodes through the fiber optic network.

Meanwhile, the invention provides a multi-channel interconnection scheme and a cross-node interconnection topology between the acceleration cards 100 in the nodes, which not only can realize the transverse expansion and the longitudinal expansion of the computing power clusters, but also can realize direct communication between devices in the nodes, wherein the direct communication channels and the network channels are redundant, and the reliability of the computing system is improved.

The embodiment of the invention also provides an acceleration computing system which comprises a plurality of interconnected acceleration cards.

The acceleration card comprises a processor core, a storage controller, a first connector and a storage component, wherein the storage controller is arranged between the processor core and the first connector, the first connector is further connected with the storage component, the processor core is used for accessing the storage component through the storage controller, a first mapping relation between an address space of the processor core and an address of the storage component is established so as to execute a read-write task on the storage component based on the first mapping relation, and the storage controller is used for responding to the read-write task so as to execute the read-write operation on the storage component.

The computing system provided by the embodiment of the invention can refer to the description of the embodiment.

In the acceleration computing system provided by the embodiment of the invention, the acceleration card comprises a processor core, a storage controller, a first connector and a storage component, wherein the storage controller is arranged between the processor core and the first connector, and the first connector is also connected with the storage component. The processor core is used for accessing the storage component through the storage controller via the first connector, and establishing a first mapping relation between an address space of the processor core and an address of the storage component so as to execute a read-write task on the storage component based on the first mapping relation, and the storage controller responds to the read-write task to execute the read-write operation on the storage component. The system can utilize rich high-speed Serdes, CXL IP resources and NVMe controller IP of an FPGA chip to mount a DDR DIMM memory module and NVMe SSD solid state disk, realize direct mounting of the TB-level GPU between the extended memory and the GPU instance and direct data communication through MC-link, thereby increasing the utilization rate of GPU shared memory, improving the utilization rate of single GPU computing resources, reducing data communication quantity among the GPUs, reducing large model communication bottleneck and training reasoning time, and effectively reducing the cost of the data center of AI computing power.

The embodiment of the invention also provides a control method of the accelerator card, which is applied to a processor core of the accelerator card and can comprise the steps of accessing a storage component through a storage controller to acquire storage resource information of the storage component, establishing a first mapping relation between an address space of the processor core and an address of the storage component according to the storage resource information, executing a read-write task of the storage component based on the first mapping relation, and arranging the storage controller between the processor core and a first connector of the accelerator card, wherein the first connector is also connected with the storage component.

The control method of the accelerator card provided by the embodiment of the invention can refer to the description of the above embodiment.

An embodiment of the present invention also provides an electronic device including a memory having a computer program stored therein and a processor configured to run the computer program to perform the steps of any of the above-described embodiments of the accelerator card control method.

An embodiment of the present invention also provides a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform, when executed, the steps of any of the above-described embodiments of the method for controlling an accelerator card.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.

The embodiment of the invention also provides a computer program product, which comprises a computer program, and the computer program realizes the steps in the embodiment of the control method of any acceleration card when being executed by a processor.

Embodiments of the present invention also provide another computer program product, including a non-volatile computer readable storage medium storing a computer program, which when executed by a processor implements the steps of any of the above embodiments of the acceleration card control method.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The acceleration card, the control method and the acceleration computing system provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.

Claims

1. An accelerator card is characterized by comprising a processor core, a memory controller, a first connector and a memory component;

2. The accelerator card of claim 1, wherein at least one thread of at least one of the processor cores runs a root complex bridge driver to configure the memory component as a bus endpoint device.

3. The accelerator card of claim 1, wherein the first connector is an inter-card interconnect connector of the accelerator card and the memory controller is a straddle-type card forwarding control module corresponding to the first connector.

4. The accelerator card of claim 1, wherein the first connector is a first pin in a first interface of the accelerator card;

The first interface is an interface of the acceleration card for connecting with a server host;

The storage controller is connected with the storage component through the first pin, the on-board wiring of the server host and the first slot of the server host, and the first slot is used for installing the storage component.

5. The accelerator card of claim 1, wherein the processor core is configured to implement an instance of the accelerator card based on logic programming of the first controller.

6. The accelerator card of claim 1, wherein the memory controller is a memory access controller and the memory component is a memory granule.

7. The accelerator card of claim 1, wherein the memory controller comprises a memory access controller and a memory access controller, wherein the memory component comprises a memory granule and a first nonvolatile memory device;

The memory access controller is also connected with the memory particles through a corresponding first connector;

the first end of the memory access controller is connected with the memory access controller;

the second end of the storage access controller is connected with the first nonvolatile storage device through the corresponding first connector.

8. The accelerator card of claim 6 or 7, wherein the memory granule is a shared memory of a plurality of the processor cores;

The memory controller also comprises a shared memory controller arranged between the processor core and the memory access controller, wherein the shared memory controller is used for executing a first data movement task between a server host and an acceleration card and/or a second data movement task between different processor cores in the acceleration card.

9. The accelerator card of claim 1, further comprising a port routing forwarding module, a cross-board forwarding module, and a second connector;

The port routing forwarding module is arranged between the processor core and the storage controller and is also connected with the cross-board forwarding module;

The cross board card forwarding module is connected with another acceleration card through the second connector and is used for forwarding a data packet between the acceleration card and the other acceleration card;

the port route forwarding module is used for executing data packet format conversion between the accelerator card and another accelerator card.

10. The accelerator card of claim 9, wherein the cross-board card forwarding module is an in-node forwarding module;

and the intra-node forwarding module is connected with another acceleration card of the server through the second connector and the cable.

11. The accelerator card of claim 9, wherein the cross-board card forwarding module is a supernode forwarding module;

the super node forwarding module is connected with another acceleration card of another server through the second connector.

12. The accelerator card of claim 11, wherein the second connector corresponding to the supernode forwarding module is an optical module;

the acceleration card also comprises a remote direct memory access protocol stack module arranged between the supernode forwarding module and the storage controller.

13. The accelerator card of claim 12, wherein the supernode forwarding module is coupled to another accelerator card of another server via the second connector, comprising:

The super node forwarding module is connected to an Ethernet switch through the optical module of the acceleration card and is connected with the optical module of another acceleration card of another server through the Ethernet switch.

14. The accelerator card of claim 9, further comprising an arbiter disposed between the port routing forwarding module and the memory controller;

the arbiter is for arbitrating a plurality of access tasks to the storage component.

15. The accelerator card of claim 14, wherein the cross-board card forwarding module comprises an intra-node forwarding module and a super-node forwarding module;

the intra-node forwarding module is connected with another acceleration card in the server through the corresponding second connector and the cable;

the super node forwarding module is connected with another acceleration card of another server through the corresponding second connector;

the acceleration card also comprises a remote direct memory access protocol stack module, and the remote direct memory access protocol stack module is arranged between the supernode forwarding module and the arbiter;

the type of the access task to the storage component is at least one of the access task of the server host to the storage component, the access task of another acceleration card of the server to the storage component and the access task of another acceleration card of another server to the storage component.

16. The accelerator card of claim 15, further comprising a control register module coupled to the processor core, the remote direct memory access protocol stack module, the arbiter, and the memory controller, respectively;

the control register module is configured to perform remote direct memory access register management on the remote direct memory access protocol stack module, the arbiter, and the memory controller by the processor core.

17. The accelerator card of claim 1, further comprising a management controller;

The management controller is connected with the processor core and is used for monitoring the component state of the accelerator card.

18. The accelerator card of claim 1, wherein the processor core performs read-write tasks to the memory component based on the first mapping relationship, comprising:

And the processor core reports the storage resource information of the acceleration card to a unified address management server, receives a supernode unified address space allocated by the unified address management server for the acceleration card, initializes a second mapping relation between the supernode unified address space and the address space of the processor core, and executes a read-write task for the storage component based on the second mapping relation and the first mapping relation.

19. An acceleration computing system comprising a plurality of interconnected acceleration cards;

20. A method for controlling an accelerator card, applied to a processor core of the accelerator card, comprising: