Movatterモバイル変換

Hopper (microarchitecture)

From Wikipedia, the free encyclopedia

GPU microarchitecture designed by Nvidia

Hopper
Product Series
Launched	September 20, 2022; 2 years ago (2022-09-20)
Designed by	Nvidia
Manufactured by	TSMC
Fabrication process	TSMCN4
Server/datacenter	Tesla H series
Specifications
L1 cache	256 KB (per SM)
L2 cache	50 MB
Memory support	HBM3
PCIe support	PCI Express 5.0
Media Engine
Encoder(s) supported	NVENC
History
Predecessor	Ampere
Variant	Ada Lovelace (consumer and professional)
Successor	Blackwell

Hopper is agraphics processing unit (GPU)microarchitecture developed byNvidia. It is designed for datacenters and is used alongside theLovelace microarchitecture. It is the latest generation of the line of products formerly branded asNvidia Tesla, now Nvidia Data Centre GPUs.

Named for computer scientist andUnited States Navy rear admiral Grace Hopper, the Hopper architecture was leaked in November 2019 and officially revealed in March 2022. It improves upon its predecessors, theTuring andAmpere microarchitectures, featuring a newstreaming multiprocessor, a faster memory subsystem, and atransformer acceleration engine.

Architecture

[edit]

The Nvidia Hopper H100 GPU is implemented using theTSMC N4 process with 80 billion transistors. It consists of up to 144streaming multiprocessors.^[1] Due to the increased memory bandwidth provided by the SXM5 socket, the Nvidia Hopper H100 offers better performance when used in an SXM5 configuration than in the typical PCIe socket.^[2]

Streaming multiprocessor

[edit]

The streaming multiprocessors for Hopper improve upon theTuring andAmpere microarchitectures, although the maximum number of concurrent warps per streaming multiprocessor (SM) remains the same between the Ampere and Hopper architectures, 64.^[3] The Hopper architecture provides a Tensor Memory Accelerator (TMA), which supports bidirectional asynchronous memory transfer between shared memory and global memory.^[4] Under TMA, applications may transfer up to 5D tensors. When writing from shared memory to global memory, elementwise reduction and bitwise operators may be used, avoiding registers and SM instructions while enabling users to write warp specialized codes. TMA is exposed throughcuda::memcpy_async.^[5]

When parallelizing applications, developers can usethread block clusters. Thread blocks may perform atomics in the shared memory of other thread blocks within its cluster, otherwise known asdistributed shared memory. Distributed shared memory may be used by an SM simultaneously withL2 cache; when used to communicate data between SMs, this can utilize the combined bandwidth of distributed shared memory and L2. The maximum portable cluster size is 8, although the Nvidia Hopper H100 can support a cluster size of 16 by using thecudaFuncAttributeNonPortableClusterSizeAllowed function, potentially at the cost of reduced number of active blocks.^[6] With L2 multicasting and distributed shared memory, the required bandwidth fordynamic random-access memory read and writes is reduced.^[7]

Hopper features improvedsingle-precision floating-point format (FP32) throughput with twice as many FP32 operations per cycle per SM than its predecessor. Additionally, the Hopper architecture adds support for new instructions, including theSmith–Waterman algorithm.^[6] Like Ampere, TensorFloat-32 (TF-32) arithmetic is supported. The mapping pattern for both architectures is identical.^[8]

Memory

[edit]

The Nvidia Hopper H100 supportsHBM3 andHBM2e memory up to 80 GB; the HBM3 memory system supports 3 TB/s, an increase of 50% over the Nvidia Ampere A100's 2 TB/s. Across the architecture, the L2 cache capacity and bandwidth were increased.^[9]

Hopper allowsCUDA compute kernels to utilize automatic inline compression, including in individual memory allocation, which allows accessing memory at higher bandwidth. This feature does not increase the amount of memory available to the application, because the data (and thus itscompressibility) may be changed at any time. The compressor will automatically choose between several compression algorithms.^[9]

The Nvidia Hopper H100 increases the capacity of the combined L1 cache, texture cache, and shared memory to 256 KB. Like its predecessors, it combines L1 and texture caches into a unified cache designed to be a coalescing buffer. The attributecudaFuncAttributePreferredSharedMemoryCarveout may be used to define the carveout of the L1 cache. Hopper introduces enhancements toNVLink through a new generation with faster overall communication bandwidth.^[10]

Memory synchronization domains

[edit]

Some CUDA applications may experience interference when performing fence or flush operations due to memory ordering. Because the GPU cannot know which writes are guaranteed and which are visible by chance timing, it may wait on unnecessary memory operations, thus slowing down fence or flush operations. For example, when a kernel performs computations in GPU memory and a parallel kernel performs communications with a peer, the local kernel will flush its writes, resulting in slower NVLink orPCIe writes. In the Hopper architecture, the GPU can reduce the net cast through a fence operation.^[11]

DPX instructions

[edit]

The Hopper architecture mathapplication programming interface (API) exposes functions in the SM such as__viaddmin_s16x2_relu, which performs the per-halfword $max(min(a+b,c),0)$ . In the Smith–Waterman algorithm,__vimax3_s16x2_relu can be used, a three-way min or max followed by a clamp to zero.^[12] Similarly, Hopper speeds up implementations of theNeedleman–Wunsch algorithm.^[13]

Transformer engine

[edit]

The Hopper architecture was the first Nvidia architecture to implement the transformer engine.^[14] The transformer engine accelerates computations by dynamically reducing them from higher numerical precisions (i.e., FP16) to lower precisions that are faster to perform (i.e., FP8) when the loss in precision is deemed acceptable.^[14] The transformer engine is also capable of dynamically allocating bits in the chosen precision to either the mantissa or exponent at runtime to maximize precision.^[5]

Power efficiency

[edit]

The SXM5 form factor H100 has athermal design power (TDP) of 700watts. With regards to its asynchrony, the Hopper architecture may attain high degrees of utilization and thus may have a better performance-per-watt.^[15]

Grace Hopper

[edit]

Grace Hopper GH200
Specifications
Designed by	Nvidia
Manufactured by	TSMC
Fabrication process	TSMC4N
Codename(s)	Grace Hopper
Compute	GPU: 132 Hopper SMs CPU: 72Neoverse V2 cores
Shader clock rate	1980 MHz
Memory support	GPU: 96 GB HBM3 or 144 GB HBM3e CPU: 480 GB LPDDR5X

The GH200 combines a Hopper-based H100 GPU with a Grace-based 72-core CPU on a single module. The total power draw of the module is up to 1000 W. CPU and GPU are connected via NVLink, which provides memory coherence between CPU and GPU memory.^[16]

History

[edit]

In November 2019, a well-knownTwitter account posted a tweet revealing that the next architecture afterAmpere would be called Hopper, named after computer scientist andUnited States Navy rear admiralGrace Hopper, one of the first programmers of theHarvard Mark I. The account stated that Hopper would be based on amulti-chip module design, which would result in a yield gain with lower wastage.^[17]

During the 2022Nvidia GTC, Nvidia officially announced Hopper.^[18]

In late 2022, due toUS regulations limiting the export of chips to thePeople's Republic of China, Nvidia adapted the H100 chip to the Chinese market with the H800. This model has lower bandwidth compared to the original H100 model.^[19]^[20] In late 2023, the US government announced new restrictions on the export of AI chips to China, including theA800 and H800 models.^[21] This led to Nvidia creating another chip predicated on Hopper microarchitecture: the H20, a modified version of the H100. The H20 had become the most prominent chip in the Chinese market as of 2025.^[22]

By 2023, during theAI boom, H100s were in great demand.Larry Ellison ofOracle Corporation said that year that at a dinner with Nvidia CEOJensen Huang, he andElon Musk ofTesla, Inc. andxAI "were begging" for H100s, "I guess is the best way to describe it. An hour of sushi and begging".^[23]

In January 2024,Raymond James Financial analysts estimated that Nvidia was selling the H100 GPU in the price range of $25,000 to $30,000 each, while on eBay, individual H100s cost over $40,000.^[24] As of February 2024, Nvidia was reportedly shipping H100 GPUs to data centers in armored cars.^[25]

H100 accelerator and DGX H100

[edit]

Comparison of accelerators used in DGX:^[26]^[27]^[28]

Model	Architecture	Socket	FP32 CUDA cores	FP64 cores (excl. tensor)	Mixed INT32/FP32 cores	INT32 cores	Boost clock	Memory clock	Memory bus width	Memory bandwidth	VRAM	Single precision (FP32)	Double precision (FP64)	INT8 (non-tensor)	INT8 dense tensor	INT32	FP4 dense tensor	FP16	FP16 dense tensor	bfloat16 dense tensor	TensorFloat-32 (TF32) dense tensor	FP64 dense tensor	Interconnect (NVLink)	GPU	L1 Cache	L2 Cache	TDP	Die size	Transistor count	Process	Launched
P100	Pascal	SXM/SXM2	3584	1792	N/A	N/A	1480 MHz	1.4 Gbit/s HBM2	4096-bit	720 GB/sec	16 GB HBM2	10.6 TFLOPS	5.3 TFLOPS	N/A	N/A	N/A	N/A	21.2 TFLOPS	N/A	N/A	N/A	N/A	160 GB/sec	GP100	1344 KB (24 KB × 56)	4096 KB	300 W	610 mm²	15.3 B	TSMC 16FF+	Q2 2016
V100 16GB	Volta	SXM2	5120	2560	N/A	5120	1530 MHz	1.75 Gbit/s HBM2	4096-bit	900 GB/sec	16 GB HBM2	15.7 TFLOPS	7.8 TFLOPS	62 TOPS	N/A	15.7 TOPS	N/A	31.4 TFLOPS	125 TFLOPS	N/A	N/A	N/A	300 GB/sec	GV100	10240 KB (128 KB × 80)	6144 KB	300 W	815 mm²	21.1 B	TSMC 12FFN	Q3 2017
V100 32GB	Volta	SXM3	5120	2560	N/A	5120	1530 MHz	1.75 Gbit/s HBM2	4096-bit	900 GB/sec	32 GB HBM2	15.7 TFLOPS	7.8 TFLOPS	62 TOPS	N/A	15.7 TOPS	N/A	31.4 TFLOPS	125 TFLOPS	N/A	N/A	N/A	300 GB/sec	GV100	10240 KB (128 KB × 80)	6144 KB	350 W	815 mm²	21.1 B	TSMC 12FFN	Q3 2017
A100 40GB	Ampere	SXM4	6912	3456	6912	N/A	1410 MHz	2.4 Gbit/s HBM2	5120-bit	1.52 TB/sec	40 GB HBM2	19.5 TFLOPS	9.7 TFLOPS	N/A	624 TOPS	19.5 TOPS	N/A	78 TFLOPS	312 TFLOPS	312 TFLOPS	156 TFLOPS	19.5 TFLOPS	600 GB/sec	GA100	20736 KB (192 KB × 108)	40960 KB	400 W	826 mm²	54.2 B	TSMC N7	Q1 2020
A100 80GB	Ampere	SXM4	6912	3456	6912	N/A	1410 MHz	3.2 Gbit/s HBM2e	5120-bit	1.52 TB/sec	80 GB HBM2e	19.5 TFLOPS	9.7 TFLOPS	N/A	624 TOPS	19.5 TOPS	N/A	78 TFLOPS	312 TFLOPS	312 TFLOPS	156 TFLOPS	19.5 TFLOPS	600 GB/sec	GA100	20736 KB (192 KB × 108)	40960 KB	400 W	826 mm²	54.2 B	TSMC N7	Q1 2020
H100	Hopper	SXM5	16896	4608	16896	N/A	1980 MHz	5.2 Gbit/s HBM3	5120-bit	3.35 TB/sec	80 GB HBM3	67 TFLOPS	34 TFLOPS	N/A	1.98 POPS	N/A	N/A	N/A	990 TFLOPS	990 TFLOPS	495 TFLOPS	67 TFLOPS	900 GB/sec	GH100	25344 KB (192 KB × 132)	51200 KB	700 W	814 mm²	80 B	TSMC 4N	Q3 2022
H200	Hopper	SXM5	16896	4608	16896	N/A	1980 MHz	6.3 Gbit/s HBM3e	6144-bit	4.8 TB/sec	141 GB HBM3e	67 TFLOPS	34 TFLOPS	N/A	1.98 POPS	N/A	N/A	N/A	990 TFLOPS	990 TFLOPS	495 TFLOPS	67 TFLOPS	900 GB/sec	GH100	25344 KB (192 KB × 132)	51200 KB	1000 W	814 mm²	80 B	TSMC 4N	Q3 2023
B100	Blackwell	SXM6	N/A	N/A	N/A	N/A	N/A	8 Gbit/s HBM3e	8192-bit	8 TB/sec	192 GB HBM3e	N/A	N/A	N/A	3.5 POPS	N/A	7 PFLOPS	N/A	1.98 PFLOPS	1.98 PFLOPS	989 TFLOPS	30 TFLOPS	1.8 TB/sec	GB100	N/A	N/A	700 W	N/A	208 B	TSMC 4NP	Q4 2024
B200	Blackwell	SXM6	N/A	N/A	N/A	N/A	N/A	8 Gbit/s HBM3e	8192-bit	8 TB/sec	192 GB HBM3e	N/A	N/A	N/A	4.5 POPS	N/A	9 PFLOPS	N/A	2.25 PFLOPS	2.25 PFLOPS	1.2 PFLOPS	40 TFLOPS	1.8 TB/sec	GB100	N/A	N/A	1000 W	N/A	208 B	TSMC 4NP	Q4 2024