| Year created | 2014; 12 years ago (2014) |
|---|---|
| Created by |
Coherent Accelerator Processor Interface (CAPI), is a high-speed processor expansion bus standard for use in largedata center computers, initially designed to be layered on top ofPCI Express, for directly connectingcentral processing units (CPUs) to external accelerators likegraphics processing units (GPUs),ASICs,FPGAs or fast storage.[1][2] It offers low latency, high speed, direct memory access connectivity between devices of differentinstruction set architectures.
The performance scaling traditionally associated withMoore's Law—dating back to 1965—began to taper off around 2004, as both Intel'sPrescott architecture and IBM'sCell processor pushed toward a 4 GHz operating frequency. Here both projects ran into a thermal scaling wall, whereby heat extraction problems associated with further increases in operating frequency largely outweighed gains from shorter cycle times.
Over the decade that followed, few commercial CPU products exceeded 4 GHz, with the majority of performance improvements now coming from incrementally improved microarchitectures, better systems integration, and higher compute density—this largely in the form of packing a larger numbers of independent cores onto the same die, often at theexpense of peak operating frequency (Intel's 24-core Xeon E7-8890 from June 2016 has a base operating frequency of just 2.2 GHz, so as to operate within the constraints of a single-socket 165 W power consumption and cooling budget).
Where large performance gains have been realized, it was often associated with increasingly specialized compute units, such as GPU units added to the processor die, or external GPU- or FPGA-based accelerators. In many applications, accelerators struggle with limitations of the interconnect's performance (bandwidth and latency) or with limitations due to the interconnect's architecture (such as lacking memory coherence). Especially in the datacenter, improving the interconnect became paramount in moving toward a heterogeneous architecture in which hardware becomes increasingly tailored to specific compute workloads.
CAPI was developed to enable computers to more easily and efficiently attach specialized accelerators. Memory intensive and computation intensive works likematrix multiplications for deepneural networks can be offloaded into CAPI-supported platforms.[3] It was designed by IBM for use in itsPOWER8 based systems which came to market in 2014. At the same time, IBM and several other companies founded theOpenPOWER Foundation to build an ecosystem aroundPower based technologies, including CAPI. In October 2016 several OpenPOWER partners formed theOpenCAPI Consortium together with GPU and CPU designerAMD and systems designersDell EMC andHewlett Packard Enterprise to spread the technology beyond the scope of OpenPOWER and IBM.[4]
On August 1, 2022, OpenCAPI specifications and assets were transferred to theCompute Express Link (CXL) Consortium.[5]
CAPI is implemented as a functional unit inside the CPU, called the Coherent Accelerator Processor Proxy (CAPP) with a corresponding unit on the accelerator called the Power Service Layer (PSL). The CAPP and PSL units acts like a cache directory so the attached device and the CPU can share the same coherent memory space, and the accelerator becomes an Accelerator Function Unit (AFU), a peer to other functional units integrated in the CPU.[6][7]
Since the CPU and AFU share the same memory space, low latency and high speeds can be achieved since the CPU doesn't have to do memory translations and memory shuffling between the CPU's main memory and the accelerator's memory spaces. An application can make use of the accelerator without specific device drivers as everything is enabled by a general CAPI kernel extension in the host operating system. The CPU and PSL can read and write directly to each other's memories and registers, as demanded by the application.
CAPI is layered on top ofPCIe Gen 3, using 16 PCIe lanes, and is an additional functionality for the PCIe slots on CAPI enabled systems. Usually there are designated CAPI enabled PCIe slots on such machines. Since there is only one CAPP per POWER8 processor the number of possible CAPI units are determined by the number of POWER8 processors, regardless of how many PCIe slots there are. In certain POWER8 systems, IBM makes use of dual chip modules, thus doubling the CAPI capacity per processor socket.
Traditional transactions between a PCIe device and a CPU can take around 20,000 operations, whereas a CAPI attached device will only use around 500, significantly reducing latency, and effectively increasing bandwidth due to decreased operations overhead.[7]
The total bandwidth of a CAPI port is determined by the underlying PCIe 3.0 x16 technology, peaking at ca 16 GB/s, bidirectional.[8]
CAPI-2 is an incremental evolution of the technology introduced with IBM POWER9 processor.[8] It runs on top of PCIe Gen 4 that effectively doubles the performance to 32 GB/s. It also introduces some new features like support for DMA and Atomics from the accelerator.
The technology behind OpenCAPI is governed by theOpenCAPI Consortium, founded in October 2016 byAMD,Google,IBM,Mellanox andMicron together with partnersNvidia,Hewlett Packard Enterprise,Dell EMC andXilinx.[9]
OpenCAPI, formerlyNew CAPI orCAPI 3.0, is not layered on top of PCIe and will therefore not use PCIe slots. In IBM's CPUPOWER9 it will use theBluelink 25G I/O facility that it shares withNVLink 2.0, peaking at 50 GB/s.[10] OpenCAPI doesn't need the PSL unit (required for CAPI 1 and 2) in the accelerator, as it's not layered on top of PCIe but uses its own transaction protocol.[11]
Planned for future chip after the General Availability of POWER9.[12]
OpenCAPI Memory Interface (OMI) is aserial attachedRAM technology based on OpenCAPI, providinglow latency,high bandwidth connection for main memory. OMI uses a controller chip on the memory modules that allows for technology agnostic approach to what is used on the modules, be itDDR4,DDR5,HBM or storage classnon-volatile RAM. An OMI based CPU can therefore change RAM type by changing the memory modules.
A serial connection uses less floorspace for the interface on the CPU die therefore potentially allowing more of them compared to using common DDR memory.
OMI is implemented in IBM'sPower10 CPU, which has 8 OMI memory controllers on-chip, allowing for 4 TB RAM and 410 GB/s memory bandwidth per processor. These DDIMMs (Differential Dynamic Memory Module) includes a OMI controller and memory buffer, and can address individual memory chips for fault tolerance and redundancy purposes.
Microchip Technology manufactures the OMI controller on the DDIMMs. Their SMC 1000 OpenCAPI memory is described as "the next progression in the market adopting serial attached memory."[13]
Legacy, abandoned
Legacy, updated
Contemporary
{{cite book}}: CS1 maint: multiple names: authors list (link)