Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present invention.
It should be noted that in the description of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The present invention will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present invention.
In cloud computing and artificial intelligence computing, the modern data center adopts more and more acceleration cards, and the computing uses more and more processor cores, so that the allocated memory size and bandwidth of each processor core are not correspondingly improved. The artificial intelligent model calculation needs more accelerator cards and memory units to support, the calculation power of a single accelerator card is limited, and a plurality of accelerator cards are needed to cooperatively calculate, so that the inter-accelerator card bus interconnection communication is involved, and intermediate calculation cache data is transmitted. The memory that can be accessed by a single accelerator card is limited, and the memory capacity of the accelerator card is continuously expanded, so that more memory sharing among the accelerator cards is realized.
Computer fast interconnect (Compute Express Link, CXL) is a high-speed serial protocol that allows for fast, reliable data transfer between different components within a computer system. The method aims to solve bottleneck problems in high-performance calculation, including problems of memory capacity, memory bandwidth, I/O delay and the like. The computer quick interconnection can also realize memory expansion and memory sharing, and can communicate with the external devices such as a computing accelerator (such as a GPU (graphics processing unit), an FPGA (field programmable gate array) and the like, thereby providing a faster and more flexible data exchange and processing mode.
In the related art, by making a memory expansion card in the form of a high-speed serial computer expansion bus (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe) gold finger based on memory granules based on a computer fast interconnect memory protocol (CXL Type 3), an expansion memory can be provided for an accelerator card. However, when the accelerator card accesses the extended memory, the accelerator card needs to access the extended memory through the memory copy operation of the central processing unit (Central Processing Unit, CPU) or through a Root Complex (RC) controller or a PCIe switching controller (PCIE SWITCH) integrated inside the central processing unit (Central Processing Unit, CPU), and the accelerator card has long access path, large delay and limited extended capacity.
In order to solve the problem that an accelerator card accesses a path process of an extended memory when the memory is extended by using the memory extended card as the accelerator card, the embodiment of the invention provides the accelerator card which comprises a processor core, a memory controller, a first connector and a memory component, wherein the memory controller is arranged between the processor core and the first connector, and the first connector is also connected with the memory component. The processor core is used for accessing the storage component through the storage controller via the first connector, establishing a first mapping relation between an address space of the processor core and an address of the storage component, executing a read-write task on the storage component based on the first mapping relation, and responding the read-write task by the storage controller to execute the read-write operation on the storage component, so that the utilization rate of single computing power resources can be improved in distributed computing, the communication quantity among acceleration cards is reduced, the communication bottleneck is broken through, and the distributed computing efficiency is improved.
Fig. 1 is a schematic structural diagram of an acceleration card according to an embodiment of the present invention, and fig. 2 is a schematic structural diagram of a computing processor according to an embodiment of the present invention.
The accelerator card provided by the embodiment of the invention can comprise a processor core, a storage controller, a first connector and a storage component, wherein the storage controller is arranged between the processor core and the first connector, the first connector is also connected with the storage component, the processor core is used for accessing the storage component through the storage controller through the first connector, a first mapping relation between an address space of the processor core and an address of the storage component is established so as to execute a read-write task on the storage component based on the first mapping relation, and the storage controller is used for responding to the read-write task so as to execute read-write operation on the storage component.
In embodiments of the invention, the accelerator card may refer to a graphics processor (Graphics Processing Unit, GPU).
The external interfaces of accelerator cards typically include a first interface for connecting to a server host and an inter-card interconnect connector for connecting to other accelerator cards. The first interface is typically a PCIe interface, and is inserted in the form of a golden finger into a first slot (PCIe slot) of the server host. The inter-card interconnect connector may be an inter-card high-speed interconnect connector NVLink. The different acceleration cards in the same server can be directly interconnected through the inter-card interconnection connector, and the acceleration cards of different servers can be interconnected through the inter-card interconnection connector and a switching controller matched with NVLink.
In the embodiment of the present invention, the first connector for connecting the storage component may be an inter-card interconnection connector of the accelerator card, and the storage controller is a trans-board card forwarding control module corresponding to the first connector. That is, one or more inter-card interconnect connectors of the accelerator card may be configured as a first connector for connecting to the memory component, and a trans-board card forwarding control module of the accelerator card that is originally used to control the inter-card interconnect connector to perform the inter-card interconnect function may be configured as a memory controller.
In other optional implementations of the embodiments of the present invention, if an inter-card interconnect connector of the accelerator card is used as the first connector for connecting the storage component, the storage controller may be additionally deployed based on hardware resources of the computing processor of the accelerator card, and the connection between the inter-card interconnect connector and the inter-board forwarding control module may be changed to be connected to the storage controller.
In the embodiment of the invention, the first connector for connecting the storage component can also adopt a first pin in a first interface of the accelerator card, wherein the first interface is an interface of the accelerator card for connecting the server host, and the storage controller is connected with the storage component through the first pin, an on-board wiring of the server host and a first slot of the server host, and the first slot is used for installing the storage component. The first interface is an interface of the accelerator card for connecting with the server host, and can be configured into two lines under the condition that the number of channels of the first interface is enough, one line is still used for connecting with the server host, the other line is used for connecting with the first slot through the on-board wiring on the server host, and the storage component is arranged on the first slot, so that the storage server is connected with the storage component through the first connector.
Therefore, the problem that the computing power utilization rate is low due to the fact that the computing power of the accelerator card is not matched with the memory resources can be solved by utilizing the external interface resources of the accelerator card to be configured as the first connector for connecting the storage component and realizing the storage controller based on the computing processor of the accelerator card.
In the embodiment of the invention, the processor core can also be obtained by programming the example of realizing the accelerator card based on the logic circuit of the first controller. The first controller may be a programmable controller, such as a field programmable gate array (Field Programmable GATE ARRAY, FPGA), or may be other types of programmable controllers.
The logic circuit based on the programmable controller provides hardware resources, an acceleration card instance can be built according to the open source code of the acceleration card to serve as a processor core, other hardware resources on the first controller, such as serial-parallel conversion channels (Serdes), computer quick interconnection controller (CXL IP) resources, storage controller resources and the like, can be utilized to mount a storage component with larger capacity, and the purpose of directly mounting the acceleration card instance with high calculation power into an extended memory is achieved, so that the problem that the calculation power utilization rate is low due to the fact that the calculation power of the acceleration card is not matched with the memory resources in the related technology is solved.
As shown in fig. 1, a processor core and a memory controller may be provided in a computing processor 101.
Computing processor 101 may include one or more processor cores (processor core 1, processor core 2, a..the., processor core N, as shown in fig. 2).
If the computing processor 101 includes multiple processor cores, the multiple processor cores may share the memory resources of the memory component, or may divide the memory resources into memory resources corresponding to different processor cores, or the memory component may include the memory resources that are shared or unique to the processor cores.
As shown in fig. 1, the accelerator card 100 may include a first golden finger interface 109, where the first golden finger interface 109 may be a 16-channel fifth-generation PCIe (PCIe Gen5 x 16) standard golden finger interface, and the accelerator card 100 is inserted into a PCIe slot of the server host 200 through the first golden finger interface 109. If the first connector 104 is implemented by a pin of the first interface, that is, a pin of the first golden finger interface 109.
In the embodiment of the present invention, the connector of the accelerator card 100 for inter-card interconnection may be described as a second connector.
The second connector may include a third connector 105 for enabling interconnection of different accelerator cards 100 within the same server, and the different accelerator cards 100 within the same server may be interconnected by the third connector 105 and a cable.
The third connector 105 may be a NVLink connector, so that different accelerator cards 100 in the same server may be directly connected through the third connector 105.
The third connector 105 may also employ a Multi-Channel input/output (MCIO) connector, and the cable may be a PCIe bus. The third connector 105 of the accelerator card 100 may be connected to the switch controller through a cable, and interconnection between the accelerator card 100 and between the accelerator card 100 and the server host 200 is achieved via the switch controller.
In other alternative implementations of embodiments of the invention, at least one thread of at least one processor core runs a Root Complex (RC) bridge driver to configure the storage component as an Endpoint (EP) device. Based on this, the processor core can actively access the memory component of another accelerator card 100 through the direct connection relationship between the accelerator cards 100 without going through the server host 200, the switch controller or the switch card, thereby further shortening the access path and reducing the access delay. Through the third connector 105, a Multi-Channel I/O (MCIO) connector is adopted, each MCIOx interface contains 4 pairs of high-speed serial-parallel conversion (SerDes) channels, and each Channel bandwidth supports 50Gb/s, so that the unidirectional bandwidth of each MCIOx interface is 200Gb/s, the bidirectional bandwidth is 400Gb/s, and the direct-connection large-bandwidth communication requirement between cards can be met. Any two cards in the node can be directly communicated through a multichannel connection technology (MC-link), so that system integration is facilitated, a switch and a transfer card are not needed, and the high-bandwidth low-delay characteristic is achieved.
As shown in fig. 1, the second connector of the accelerator card 100 for implementing inter-card interconnection may further include an optical module 106, where the accelerator card 100 may access the unified address management server 400 through the optical module 106 to report local storage resources, accept that the unified address management server 400 allocates supernode unified address spaces for the plurality of accelerator cards 100, and initialize a unified address conversion table (Unified Address Translation Table, UATT) based on the allocated supernode unified address spaces and the address space of the local processor core, which is denoted as a second mapping relationship. Thus, based on the global unified address, the interconnection between the accelerator cards 100 in the same server can be realized, the interconnection between the accelerator cards 100 across servers can be realized, and the server host 200 can access the accelerator cards 100 based on the global unified address. The optical module 106 may employ a 400G fiber optic network module to access the fiber optic network through an optical module interface as shown in fig. 2.
In the embodiment of the present invention, the optical module 106 may also adopt an inter-card interconnection connector of an accelerator card, the inter-card interconnection connector is configured as the optical module 106, one or more inter-card interconnection connectors of the accelerator card 100 may be configured as the optical module 106, and a trans-board card forwarding control module originally used in the accelerator card for controlling the inter-card interconnection connector to implement the inter-card interconnection function is configured as the super node forwarding module provided in the embodiment of the present invention or a hardware resource of the computing processor 101 based on the accelerator card 100 is configured as the extra super node forwarding module, so that a larger communication bandwidth is implemented when inter-server inter-card interconnection is implemented.
In the embodiment of the present invention, at least one thread of at least one processor core of the accelerator card 100 runs an RC bridge driver, different accelerator cards 100 located in the same server may be directly interconnected through the first connector 104 and the cable, and different accelerator cards 100 across the server may be interconnected point-to-point through the optical module 106, or may be interconnected in many-to-many through the optical module 106 and the ethernet switch.
In the embodiment of the invention, the processor core executes the read-write task on the storage component based on the first mapping relation, and the method can comprise the steps that the processor core reports the storage resource information of the acceleration card 100 to the unified address management server 400, receives the supernode unified address space distributed by the unified address management server 400 for the acceleration card 100, initializes the second mapping relation between the supernode unified address space and the address space of the processor core, and executes the read-write task on the storage component based on the second mapping relation and the first mapping relation.
As shown in fig. 1, the accelerator card 100 may further include a management controller 107, where the management controller 107 is connected to the processor core of the accelerator card 100 for monitoring the status of components of the accelerator card. The management controller 107 may be connected to the computing processor 101, the power module 108, and other components of the accelerator card 100, and is configured to monitor an operation state of each component, and perform fault detection, fault recording, fault reporting, and so on. The management controller 107 may employ a microcontroller (Micro Controller Unit, MCU) to collect information such as voltage, current, temperature, etc. of the components on the accelerator card 100 via sensors.
The acceleration card provided by the embodiment of the invention comprises a processor core, a storage controller, a first connector and a storage component, wherein the storage controller is arranged between the processor core and the first connector, and the first connector is also connected with the storage component. The processor core is used for accessing the storage component through the storage controller via the first connector, establishing a first mapping relation between an address space of the processor core and an address of the storage component, executing a read-write task on the storage component based on the first mapping relation, and responding the read-write task by the storage controller to execute the read-write operation on the storage component, so that the utilization rate of single computing power resources can be improved in distributed computing, the communication quantity among acceleration cards is reduced, the communication bottleneck is broken through, and the distributed computing efficiency is improved.
Based on the above embodiments, as shown in fig. 1, in the accelerator card 100 provided in the embodiment of the present invention, the storage controller may include a memory access controller (not shown in fig. 1), and the storage component may be a memory granule 102.
The memory particles 102 may be In the form of Double Data Rate (DDR) memory, in the form of Dual inline memory modules (Dual Inline Memory Module, DIMMs), in the form of multiple memory particles 102 integrated on a circuit board, in the form of registered Dual inline memory modules (REGISTERED DIMM, RDIMM), unbuffered Dual inline memory modules (Unbuffered DIMMs, UDIMM), mini Dual In-line Memory Module, mini-DIMMs, and other types of memory particles 102 In other forms of packages.
As shown in fig. 2, a memory access controller is provided in the computing processor 101 for interfacing with the memory granule 102. The memory access controller is used for analyzing and forwarding the access request information and the access response information of the storage component. The access request information may include an access type and an access address.
In order to further expand the storage resources of the accelerator card 100, in the accelerator card 100 provided by the embodiment of the invention, the storage controller may include a memory access controller and a storage access controller, the storage component includes a memory granule 102 and a first nonvolatile storage device 103, the memory access controller is further connected to the memory granule 102 through a corresponding first connector 104, a first end of the storage access controller is connected to the memory access controller, and a second end of the storage access controller is connected to the first nonvolatile storage device 103 through a corresponding first connector 104.
The memory access controller analyzes and forwards the access request information and the access response information to the storage component, which may include the memory access controller analyzing the access request information, if the access address in the access request information hits the memory, reading the corresponding access response information from the memory grain 102 and returning the access response information to the memory access controller, if the access address in the access request information does not hit the memory, forwarding the access request information to the storage access controller, and buffering the response data in the access response information returned by the storage access controller in the memory grain 102. The memory access controller is configured to read corresponding response data from the first nonvolatile memory device 103 according to an access address in the access request information sent by the memory access controller, and return the response data. Thus, the first nonvolatile storage device 103 can be utilized to expand larger storage resources for the accelerator card 100, and the processing efficiency of the access task can be improved based on the memory-storage architecture.
In the embodiment of the present invention, the first nonvolatile storage device 103 may be, but is not limited to, a solid state disk (such as a nonvolatile Memory flash solid state disk NVMe SSD), a flash Memory (NAND FLASH Memory), an electrically erasable programmable read Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), or other types of nonvolatile storage devices may be used.
The accelerator card 100 provided in the embodiment of the invention can directly extend the first nonvolatile storage device 103, such as an NVMe solid state disk, by adopting the first connector 104, so that the coupling degree between the accelerator card 100 and a server when using an extended memory is reduced, the delay of accessing the extended memory by a processor core of the accelerator card 100 is reduced, and the local storage capacity of the accelerator card 100 is increased.
The above embodiments have described that the storage component may act as a shared storage for multiple processor cores. If the memory granule 102 is used as the shared memory of a plurality of processor cores, as shown in fig. 2, the memory controller may further include a shared memory controller disposed between the processor cores and the memory access controller, where the shared memory controller is configured to perform a first data movement task between the server host 200 and the accelerator card 100 and/or a second data movement task between different processor cores in the accelerator card 100.
The shared memory controller can realize the shared memory based on a computer quick interconnection memory (CXL.mem) protocol, and is used as a DMA controller, an extended DDR4 DIMM controller can be mounted, the DMA controller and the memory can realize caching partial data in the NVMe solid state disk, and the DMA controller is responsible for realizing data movement between a host and the accelerator card 100 and data movement in the accelerator card 100 and RDMA data movement through a network.
On the basis of the above embodiment, as shown in fig. 2, the accelerator card 100 may further include a port routing forwarding module, a cross-board card forwarding module and a second connector, where the port routing forwarding module is disposed between the processor core and the memory controller, and the port routing forwarding module is further connected to the cross-board card forwarding module, and the cross-board card forwarding module is connected to another accelerator card 100 through the second connector and is used for forwarding a data packet between the accelerator card 100 and another accelerator card 100, and the port routing forwarding module is used for performing format conversion of the data packet between the accelerator card 100 and another accelerator card 100.
In the embodiment of the present invention, the port routing forwarding module and the cross-board forwarding module may be disposed in the computing processor 101 and implemented based on logic circuit programming of the first accelerator card.
The port route forwarding module may be configured to convert a data packet based on a unified address that accesses the memory of the local other accelerator card 100 into a data packet based on a port ID and an offset address, and send the data packet to the cross board card forwarding module.
In the embodiment of the invention, the inter-board forwarding module may be an intra-node forwarding module, and the intra-node forwarding module is connected with another accelerator card 100 of the server through a second connector and a cable.
In the embodiment of the present invention, the intra-node forwarding module, that is, the intra-node forwarding media access Control Address (MAC) module, may be used to forward the data packet based on the port ID that accesses the memory of the local other accelerator card 100. For example, the memory of one accelerator card 100 may be accessed between two interconnected accelerator cards 100 in one server through a local in-node forwarding module.
In the embodiment of the invention, the cross board card forwarding module may be a supernode forwarding module, and the supernode forwarding module is connected with another accelerator card 100 of another server through a second connector.
In the embodiment of the present invention, the supernode forwarding module, i.e. the supernode forwarding MAC module, may be configured to send and receive a data packet from the optical module 106 accessing the local memory.
As shown in fig. 2, in the embodiment of the present invention, the second connector corresponding to the supernode forwarding module may be an optical module 106, and the accelerator card 100 further includes a remote direct memory access protocol stack module disposed between the supernode forwarding module and the storage controller. The light module 106 may be referred to in the description of the embodiments described above. In an embodiment of the present invention, the remote direct memory access protocol stack module may be used to run a remote direct memory access (Remote Direct Memory Access, RDMA) protocol based on ethernet to handle direct memory access (Direct Memory Access, DMA) requests from the extended memory of the network access accelerator card 100.
In the embodiment of the invention, in order to realize multi-to-multi interconnection of acceleration cards of a cross-server, the supernode forwarding module is connected with another acceleration card of another server through a second connector, and the supernode forwarding module can be connected to an Ethernet switch through an optical module 106 of the acceleration card and connected with an optical module 106 of another acceleration card of another server through the Ethernet switch.
The accelerator card 100 provided in the embodiment of the present invention may further include an arbiter disposed between the port routing forwarding module and the memory controller, where the arbiter is configured to arbitrate a plurality of access tasks to the memory component.
In the embodiment of the invention, the cross board card forwarding module can comprise an intra-node forwarding module and a super node forwarding module, wherein the intra-node forwarding module is connected with another accelerator card 100 in a server through a corresponding second connector and a cable, the super node forwarding module is connected with the other accelerator card 100 of the other server through a corresponding second connector, the accelerator card 100 further comprises a remote direct memory access protocol stack module, the remote direct memory access protocol stack module is arranged between the super node forwarding module and an arbiter, and the type of the access task to the storage component can be at least one of the access task of the server host 200 to the storage component, the access task of the other accelerator card 100 of the server to the storage component and the access task of the other accelerator card 100 of the other server to the storage component.
The accelerator card 100 provided in the embodiment of the present invention may further include a control register module connected to the processor core, the remote direct memory access protocol stack module, the arbiter, and the memory controller, where the control register module is configured to perform remote direct memory access register management on the remote direct memory access protocol stack module, the arbiter, and the memory controller by the processor core.
As shown in fig. 2, the control register module and port routing forwarding module may be mounted on the system bus, thereby shortening the communication path with the processor core.
Conventional accelerator cards, when communicating with resources outside of the server, require point-to-point communication via RDMA network cards and PCIe switches (PCIE SWITCH). The accelerator card 100 provided in the embodiment of the present invention has an optical module 106, and may access a converged ethernet RDMA version 2 (RDMA over Converged Ethernet version, rocev 2) network or a standard ethernet, to communicate with other computing nodes or storage nodes, where the computing processor 101 of the accelerator card 100 does not need to connect to an RDMA network card through PCIE SWITCH in a conventional scheme. By using the RoCE protocol stack based on the optical fiber network, the NMVe solid state disk can be remotely initialized and accessed, and the NVMe solid state disk is used as an extended memory which can be accessed by the processor core of the accelerator card 100, so that the shared memory capacity is greatly improved, and the data parallel splitting degree and the data traffic are reduced. And realizing multi-node unified memory management based on the 400G optical fiber network and the address management server, such as starting up self-checking and reporting the local memory size of each node. The 400G fiber network of each accelerator card 100 also serves as a redundant backup channel for inter-card direct channels, which can provide connectivity and flow diversion when any two card direct channels fail or are congested.
In summary, the accelerator card 100 provided in the embodiment of the invention may adopt optical modules 106 and 400G optical network interconnection, support heterogeneous memory expansion and mount expansion memory and NMVe solid-state memory on a processor core of the accelerator card 100, where the accelerator card 100 may include a computing processor 101, a first connector 104, a power management module, optical modules 106 of management controllers 107 and 400G, optical cables, PCIe gold fingers, an open memory interface (Open Memory Interface, OMI) memory module, DDR DIMM memory chips, and the like.
If the processor core is implemented by adopting the programming of the first controller, 4 accelerator cards 100 can be set in a single server, each accelerator card 100 has a golden finger interface supporting PCIe Gen5 x16 standard and MCIO x4 interfaces supporting PCIe Gen5 standard, 4 card golden fingers are respectively inserted into PCIe slots of the server host 200, and any two cards are connected through a high-speed MCIOx4 cable. The first MCIOx interface of each accelerator card 100 is connected to the NVMe solid state disk through a converter. Each MCIOx interface contains 4 pairs of high-speed SerDes channels, and the bandwidth of each channel supports 50Gb/s, so that the unidirectional bandwidth of each MCIOx interface is 200Gb/s, the bidirectional bandwidth is 400Gb/s, and the direct connection large-bandwidth communication requirement between cards can be met.
Fig. 3 is a schematic structural diagram of an acceleration inter-card interconnect according to an embodiment of the present invention.
In the following, taking 4 accelerator cards 100 (accelerator card 1, accelerator card 2, accelerator card 3, accelerator card 4) provided by the present invention as an example, a server host 200 of the two-way server includes a central processor 0 and a central processor 1, where the accelerator cards 100 are interconnected, and the connection relationship is as shown in fig. 3, and the direct inter-card communication of the accelerator cards 100, and the data flows when accessing the local extended memory and accessing the extended memory on other example cards in the node, and when remotely accessing the extended memory are as follows.
The accelerator card 100 is powered up and the server system loads the drive configuration file to complete initialization. The processor core 1 on each accelerator card 100 starts a thread to load the RC bridge driver, so as to complete the link initialization with the extended memory EP device, and the inter-board multi-channel (MC-link) physical layer completes the training and connection.
And the server system initiates initialization of an inter-board MC-link MAC layer and self-checking communication test of a transmission layer of the inter-board card of the node according to the default configuration board card interconnection topology.
The server nodes report the self-checking result and the storage resource size, the unified address management server 400 starts a unified address management allocation application program, allocates a section of super node unified address space for each server node storage, and initializes a unified address conversion table (Unified Address Translation Table, UATT) by each node server.
After each server node obtains the unified address space, the unified address space is allocated for the storage component of each accelerator card 100, and a Base ADDRESS MAPPING Table (BAMT) is initialized, the 64-bit address, and the low 48 bits are all 0, which is omitted. Table 1 shows an example of the address [55:48] bitmap of the four accelerator cards 100.
TABLE 1
The unified address [62:56] represents the ID of different server nodes, and [63] distinguishes between unified addresses or local addresses.
The expansion memory downloaded by accelerator card 100 may be accessed by a local central processor within the server node, accelerator card 100 local within the node, other accelerator cards 100 within the node, and other nodes through the fiber optic network.
Meanwhile, the invention provides a multi-channel interconnection scheme and a cross-node interconnection topology between the acceleration cards 100 in the nodes, which not only can realize the transverse expansion and the longitudinal expansion of the computing power clusters, but also can realize direct communication between devices in the nodes, wherein the direct communication channels and the network channels are redundant, and the reliability of the computing system is improved.
The embodiment of the invention also provides an acceleration computing system which comprises a plurality of interconnected acceleration cards.
The acceleration card comprises a processor core, a storage controller, a first connector and a storage component, wherein the storage controller is arranged between the processor core and the first connector, the first connector is further connected with the storage component, the processor core is used for accessing the storage component through the storage controller, a first mapping relation between an address space of the processor core and an address of the storage component is established so as to execute a read-write task on the storage component based on the first mapping relation, and the storage controller is used for responding to the read-write task so as to execute the read-write operation on the storage component.
The computing system provided by the embodiment of the invention can refer to the description of the embodiment.
In the acceleration computing system provided by the embodiment of the invention, the acceleration card comprises a processor core, a storage controller, a first connector and a storage component, wherein the storage controller is arranged between the processor core and the first connector, and the first connector is also connected with the storage component. The processor core is used for accessing the storage component through the storage controller via the first connector, and establishing a first mapping relation between an address space of the processor core and an address of the storage component so as to execute a read-write task on the storage component based on the first mapping relation, and the storage controller responds to the read-write task to execute the read-write operation on the storage component. The system can utilize rich high-speed Serdes, CXL IP resources and NVMe controller IP of an FPGA chip to mount a DDR DIMM memory module and NVMe SSD solid state disk, realize direct mounting of the TB-level GPU between the extended memory and the GPU instance and direct data communication through MC-link, thereby increasing the utilization rate of GPU shared memory, improving the utilization rate of single GPU computing resources, reducing data communication quantity among the GPUs, reducing large model communication bottleneck and training reasoning time, and effectively reducing the cost of the data center of AI computing power.
The embodiment of the invention also provides a control method of the accelerator card, which is applied to a processor core of the accelerator card and can comprise the steps of accessing a storage component through a storage controller to acquire storage resource information of the storage component, establishing a first mapping relation between an address space of the processor core and an address of the storage component according to the storage resource information, executing a read-write task of the storage component based on the first mapping relation, and arranging the storage controller between the processor core and a first connector of the accelerator card, wherein the first connector is also connected with the storage component.
The control method of the accelerator card provided by the embodiment of the invention can refer to the description of the above embodiment.
An embodiment of the present invention also provides an electronic device including a memory having a computer program stored therein and a processor configured to run the computer program to perform the steps of any of the above-described embodiments of the accelerator card control method.
An embodiment of the present invention also provides a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform, when executed, the steps of any of the above-described embodiments of the method for controlling an accelerator card.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
The embodiment of the invention also provides a computer program product, which comprises a computer program, and the computer program realizes the steps in the embodiment of the control method of any acceleration card when being executed by a processor.
Embodiments of the present invention also provide another computer program product, including a non-volatile computer readable storage medium storing a computer program, which when executed by a processor implements the steps of any of the above embodiments of the acceleration card control method.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The acceleration card, the control method and the acceleration computing system provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.