AMD Instinct™ MI100 microarchitecture
Contents
AMD Instinct™ MI100 microarchitecture#
2025-10-20
6 min read time
The following image shows the node-level architecture of a system thatcomprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ GPUs.The two EPYC processors are connected to each other with the AMD Infinity™fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links suchthat each processor can access the available node memory as a singleshared-memory domain in a non-uniform memory architecture (NUMA) fashion. In a2P, or dual-socket, configuration, three AMD Infinity™ fabric links areavailable to connect the processors plus one PCIe Gen 4 x16 link per processorcan attach additional I/O devices such as the host adapters for the networkfabric.

In a typical node configuration, each processor can host up to four AMDInstinct™ GPUs that are attached using PCIe Gen 4 links at 16 GT/sec,which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hiveof four GPUs can participate in a fully connected, coherent AMDInstinct™ fabric that connects the four GPUs using 23 GT/sec AMDInfinity fabric links that run at a higher frequency than the inter-processorlinks. This inter-GPU link can be established in certified server systems if theGPUs are mounted in neighboring PCIe slots by installing the AMD InfinityFabric™ bridge for the AMD Instinct™ GPUs.
Microarchitecture#
The microarchitecture of the AMD Instinct GPUs is based on the AMD CDNAarchitecture, which targets compute applications such as high-performancecomputing (HPC) and AI & machine learning (ML) that run on everything fromindividual servers to the world’s largest exascale supercomputers. The overallsystem architecture is designed for extreme scalability and compute performance.

The above image shows the AMD Instinct GPU with its PCIe Gen 4 x16link (16 GT/sec, at the bottom) that connects the GPU to (one of) the hostprocessor(s). It also shows the three AMD Infinity Fabric ports that providehigh-speed links (23 GT/sec, also at the bottom) to the other GPUs of the localhive.
On the left and right of the floor plan, the High Bandwidth Memory (HBM)attaches via the GPU memory controller. The MI100 generation of the AMDInstinct GPU offers four stacks of HBM generation 2 (HBM2) for a totalof 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of theattached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.
The execution units of the GPU are depicted in the above image as ComputeUnits (CU). There are a total 120 compute units that are physically organizedinto eight Shader Engines (SE) with fifteen compute units per shader engine.Each compute unit is further sub-divided into four SIMD units that process SIMDinstructions of 16 data elements per instruction. This enables the CU to process64 data elements (a so-called ‘wavefront’) at a peak clock frequency of 1.5 GHz.Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS(4[SIMDunits]x16[elementsperinstruction]x120[CU]x1.5[GHz]).

The preceding image shows the block diagram of a single CU of an AMD Instinct™MI100 GPU and summarizes how instructions flow through the executionengines. The CU fetches the instructions via a 32KB instruction cache and movesthem forward to execution via a dispatcher. The CU can handle up to tenwavefronts at a time and feed their instructions into the execution unit. Theexecution unit contains 256 vector general-purpose registers (VGPR) and 800scalar general-purpose registers (SGPR). The VGPR and SGPR are dynamicallyallocated to the executing wavefronts. A wavefront can access a maximum of 102scalar registers. Excess scalar-register usage will cause register spilling andthus may affect execution performance.
A wavefront can occupy any number of VGPRs from 0 to 256, directly affectingoccupancy; that is, the number of concurrently active wavefronts in the CU. Forinstance, with 119 VGPRs used, only two wavefronts can be active in the CU atthe same time. With the instruction latency of four cycles per SIMD instruction,the occupancy should be as high as possible such that the compute unit canimprove execution efficiency by scheduling instructions from multiplewavefronts.
Computation and Data Type | FLOPS/CLOCK/CU | Peak TFLOPS |
|---|---|---|
Vector FP64 | 64 | 11.5 |
Matrix FP32 | 256 | 46.1 |
Vector FP32 | 128 | 23.1 |
Matrix FP16 | 1024 | 184.6 |
Matrix BF16 | 512 | 92.3 |