Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Hopper (microarchitecture)

From Wikipedia, the free encyclopedia
GPU microarchitecture designed by Nvidia

Hopper
LaunchedSeptember 20, 2022; 2 years ago (2022-09-20)
Designed byNvidia
Manufactured by
Fabrication processTSMCN4
Product Series
Server/datacenter
Specifications
L1 cache256 KB (per SM)
L2 cache50 MB
Memory supportHBM3
PCIe supportPCI Express 5.0
Media Engine
Encoder(s) supportedNVENC
History
PredecessorAmpere
VariantAda Lovelace (consumer and professional)
SuccessorBlackwell
4 Nvidia H100 GPUs

Hopper is agraphics processing unit (GPU)microarchitecture developed byNvidia. It is designed for datacenters and is used alongside theLovelace microarchitecture. It is the latest generation of the line of products formerly branded asNvidia Tesla, now Nvidia Data Centre GPUs.

Named for computer scientist andUnited States Navyrear admiralGrace Hopper, the Hopper architecture was leaked in November 2019 and officially revealed in March 2022. It improves upon its predecessors, theTuring andAmpere microarchitectures, featuring a newstreaming multiprocessor, a faster memory subsystem, and a transformer acceleration engine.

Architecture

[edit]

The Nvidia Hopper H100 GPU is implemented using theTSMC N4 process with 80 billion transistors. It consists of up to 144streaming multiprocessors.[1] Due to the increased memory bandwidth provided by the SXM5 socket, the Nvidia Hopper H100 offers better performance when used in an SXM5 configuration than in the typical PCIe socket.[2]

Streaming multiprocessor

[edit]

The streaming multiprocessors for Hopper improve upon theTuring andAmpere microarchitectures, although the maximum number of concurrent warps per streaming multiprocessor (SM) remains the same between the Ampere and Hopper architectures, 64.[3] The Hopper architecture provides a Tensor Memory Accelerator (TMA), which supports bidirectional asynchronous memory transfer between shared memory and global memory.[4] Under TMA, applications may transfer up to 5D tensors. When writing from shared memory to global memory, elementwise reduction and bitwise operators may be used, avoiding registers and SM instructions while enabling users to write warp specialized codes. TMA is exposed throughcuda::memcpy_async.[5]

When parallelizing applications, developers can usethread block clusters. Thread blocks may perform atomics in the shared memory of other thread blocks within its cluster, otherwise known asdistributed shared memory. Distributed shared memory may be used by an SM simultaneously withL2 cache; when used to communicate data between SMs, this can utilize the combined bandwidth of distributed shared memory and L2. The maximum portable cluster size is 8, although the Nvidia Hopper H100 can support a cluster size of 16 by using thecudaFuncAttributeNonPortableClusterSizeAllowed function, potentially at the cost of reduced number of active blocks.[6] With L2 multicasting and distributed shared memory, the required bandwidth fordynamic random-access memory read and writes is reduced.[7]

Hopper features improvedsingle-precision floating-point format (FP32) throughput with twice as many FP32 operations per cycle per SM than its predecessor. Additionally, the Hopper architecture adds support for new instructions, including theSmith–Waterman algorithm.[6] Like Ampere, TensorFloat-32 (TF-32) arithmetic is supported. The mapping pattern for both architectures is identical.[8]

Memory

[edit]

The Nvidia Hopper H100 supportsHBM3 andHBM2e memory up to 80 GB; the HBM3 memory system supports 3 TB/s, an increase of 50% over the Nvidia Ampere A100's 2 TB/s. Across the architecture, the L2 cache capacity and bandwidth were increased.[9]

Hopper allowsCUDAcompute kernels to utilize automatic inline compression, including in individual memory allocation, which allows accessing memory at higher bandwidth. This feature does not increase the amount of memory available to the application, because the data (and thus itscompressibility) may be changed at any time. The compressor will automatically choose between several compression algorithms.[9]

The Nvidia Hopper H100 increases the capacity of the combined L1 cache, texture cache, and shared memory to 256 KB. Like its predecessors, it combines L1 and texture caches into a unified cache designed to be a coalescing buffer. The attributecudaFuncAttributePreferredSharedMemoryCarveout may be used to define the carveout of the L1 cache. Hopper introduces enhancements toNVLink through a new generation with faster overall communication bandwidth.[10]

Memory synchronization domains

[edit]

Some CUDA applications may experience interference when performing fence or flush operations due to memory ordering. Because the GPU cannot know which writes are guaranteed and which are visible by chance timing, it may wait on unnecessary memory operations, thus slowing down fence or flush operations. For example, when a kernel performs computations in GPU memory and a parallel kernel performs communications with a peer, the local kernel will flush its writes, resulting in slower NVLink orPCIe writes. In the Hopper architecture, the GPU can reduce the net cast through a fence operation.[11]

DPX instructions

[edit]

The Hopper architecture mathapplication programming interface (API) exposes functions in the SM such as__viaddmin_s16x2_relu, which performs the per-halfwordmax(min(a+b,c),0){\displaystyle max(min(a+b,c),0)}. In the Smith–Waterman algorithm,__vimax3_s16x2_relu can be used, a three-way min or max followed by a clamp to zero.[12] Similarly, Hopper speeds up implementations of theNeedleman–Wunsch algorithm.[13]

Transformer engine

[edit]

The Hopper architecture was the first Nvidia architecture to implement the transformer engine.[14] The transformer engine accelerates computations by dynamically reducing them from higher numerical precisions (i.e., FP16) to lower precisions that are faster to perform (i.e., FP8) when the loss in precision is deemed acceptable.[14] The transformer engine is also capable of dynamically allocating bits in the chosen precision to either the mantissa or exponent at runtime to maximize precision.[5]

Power efficiency

[edit]

The SXM5 form factor H100 has athermal design power (TDP) of 700watts. With regards to its asynchrony, the Hopper architecture may attain high degrees of utilization and thus may have a better performance-per-watt.[15]

Grace Hopper

[edit]
Grace Hopper GH200
Designed byNvidia
Manufactured by
Fabrication processTSMC4N
Codename(s)Grace Hopper
Specifications
ComputeGPU: 132 Hopper SMs
CPU: 72Neoverse V2 cores
Shader clock rate1980 MHz
Memory supportGPU: 96 GB HBM3 or 144 GB HBM3e
CPU: 480 GB LPDDR5X

The GH200 combines a Hopper-based H100 GPU with a Grace-based 72-core CPU on a single module. The total power draw of the module is up to 1000 W. CPU and GPU are connected via NVLink, which provides memory coherence between CPU and GPU memory.[16]

History

[edit]

In November 2019, a well-knownTwitter account posted a tweet revealing that the next architecture afterAmpere would be called Hopper, named after computer scientist andUnited States Navy rear admiralGrace Hopper, one of the first programmers of theHarvard Mark I. The account stated that Hopper would be based on amulti-chip module design, which would result in a yield gain with lower wastage.[17]

During the 2022Nvidia GTC, Nvidia officially announced Hopper.[18]

In late 2022, Due toUS regulations that limited the export of chips to thePeople's Republic of China, adapted the H100 chip to the Chinese market with the H800. This model has lower bandwidth compared to the original H100 model.[19][20] In late 2023, the US government announced new restrictions on the export of AI chips to China, including theA800 and H800 models.[21]

By 2023, during theAI boom, H100s were in great demand.Larry Ellison ofOracle Corporation said that year that at a dinner with Nvidia CEOJensen Huang, he andElon Musk ofTesla, Inc. andxAI "were begging" for H100s, "I guess is the best way to describe it. An hour of sushi and begging".[22]

In January 2024,Raymond James Financial analysts estimated that Nvidia was selling the H100 GPU in the price range of $25,000 to $30,000 each, while on eBay, individual H100s cost over $40,000.[23] As of February 2024, Nvidia was reportedly shipping H100 GPUs to data centers in armored cars.[24]

H100 accelerator and DGX H100

[edit]

Comparison of accelerators used in DGX:[25][26][27]

ModelArchitectureSocketFP32
CUDA
cores
FP64 cores
(excl. tensor)
Mixed
INT32/FP32
cores
INT32
cores
Boost
clock
Memory
clock
Memory
bus width
Memory
bandwidth
VRAMSingle
precision
(FP32)
Double
precision
(FP64)
INT8
(non-tensor)
INT8
dense tensor
INT32FP4
dense tensor
FP16FP16
dense tensor
bfloat16
dense tensor
TensorFloat-32
(TF32)
dense tensor
FP64
dense tensor
Interconnect
(NVLink)
GPUL1 CacheL2 CacheTDPDie sizeTransistor
count
ProcessLaunched
P100PascalSXM/SXM235841792N/AN/A1480 MHz1.4 Gbit/s HBM24096-bit720 GB/sec16 GB HBM210.6 TFLOPS5.3 TFLOPSN/AN/AN/AN/A21.2 TFLOPSN/AN/AN/AN/A160 GB/secGP1001344 KB (24 KB × 56)4096 KB300 W610 mm215.3 BTSMC 16FF+Q2 2016
V100 16GBVoltaSXM251202560N/A51201530 MHz1.75 Gbit/s HBM24096-bit900 GB/sec16 GB HBM215.7 TFLOPS7.8 TFLOPS62 TOPSN/A15.7 TOPSN/A31.4 TFLOPS125 TFLOPSN/AN/AN/A300 GB/secGV10010240 KB (128 KB × 80)6144 KB300 W815 mm221.1 BTSMC 12FFNQ3 2017
V100 32GBVoltaSXM351202560N/A51201530 MHz1.75 Gbit/s HBM24096-bit900 GB/sec32 GB HBM215.7 TFLOPS7.8 TFLOPS62 TOPSN/A15.7 TOPSN/A31.4 TFLOPS125 TFLOPSN/AN/AN/A300 GB/secGV10010240 KB (128 KB × 80)6144 KB350 W815 mm221.1 BTSMC 12FFN
A100 40GBAmpereSXM4691234566912N/A1410 MHz2.4 Gbit/s HBM25120-bit1.52 TB/sec40 GB HBM219.5 TFLOPS9.7 TFLOPSN/A624 TOPS19.5 TOPSN/A78 TFLOPS312 TFLOPS312 TFLOPS156 TFLOPS19.5 TFLOPS600 GB/secGA10020736 KB (192 KB × 108)40960 KB400 W826 mm254.2 BTSMC N7Q1 2020
A100 80GBAmpereSXM4691234566912N/A1410 MHz3.2 Gbit/s HBM2e5120-bit1.52 TB/sec80 GB HBM2e19.5 TFLOPS9.7 TFLOPSN/A624 TOPS19.5 TOPSN/A78 TFLOPS312 TFLOPS312 TFLOPS156 TFLOPS19.5 TFLOPS600 GB/secGA10020736 KB (192 KB × 108)40960 KB400 W826 mm254.2 BTSMC N7
H100HopperSXM516896460816896N/A1980 MHz5.2 Gbit/s HBM35120-bit3.35 TB/sec80 GB HBM367 TFLOPS34 TFLOPSN/A1.98 POPSN/AN/AN/A990 TFLOPS990 TFLOPS495 TFLOPS67 TFLOPS900 GB/secGH10025344 KB (192 KB × 132)51200 KB700 W814 mm280 BTSMC 4NQ3 2022
H200HopperSXM516896460816896N/A1980 MHz6.3 Gbit/s HBM3e6144-bit4.8 TB/sec141 GB HBM3e67 TFLOPS34 TFLOPSN/A1.98 POPSN/AN/AN/A990 TFLOPS990 TFLOPS495 TFLOPS67 TFLOPS900 GB/secGH10025344 KB (192 KB × 132)51200 KB1000 W814 mm280 BTSMC 4NQ3 2023
B100BlackwellSXM6N/AN/AN/AN/AN/A8 Gbit/s HBM3e8192-bit8 TB/sec192 GB HBM3eN/AN/AN/A3.5 POPSN/A7 PFLOPSN/A1.98 PFLOPS1.98 PFLOPS989 TFLOPS30 TFLOPS1.8 TB/secGB100N/AN/A700 WN/A208 BTSMC 4NPQ4 2024 (expected)
B200BlackwellSXM6N/AN/AN/AN/AN/A8 Gbit/s HBM3e8192-bit8 TB/sec192 GB HBM3eN/AN/AN/A4.5 POPSN/A9 PFLOPSN/A2.25 PFLOPS2.25 PFLOPS1.2 PFLOPS40 TFLOPS1.8 TB/secGB100N/AN/A1000 WN/A208 BTSMC 4NP

References

[edit]

Citations

[edit]
  1. ^Elster & Haugdahl 2022, p. 4.
  2. ^Nvidia 2023c, p. 20.
  3. ^Nvidia 2023b, p. 9.
  4. ^Fujita et al. 2023, p. 6.
  5. ^ab"Nvidia's Next GPU Shows That Transformers Are Transforming AI - IEEE Spectrum".spectrum.ieee.org. RetrievedOctober 23, 2024.
  6. ^abNvidia 2023b, p. 10.
  7. ^Vishal Mehta (September 2022).CUDA Programming Model for Hopper Architecture. Santa Clara:Nvidia. RetrievedMay 29, 2023.
  8. ^Fujita et al. 2023, p. 4.
  9. ^abNvidia 2023b, p. 11.
  10. ^Nvidia 2023b, p. 12.
  11. ^Nvidia 2023a, p. 44.
  12. ^Tirumala, Ajay; Eaton, Joe; Tyrlik, Matt (December 8, 2022)."Boosting Dynamic Programming Performance Using NVIDIA Hopper GPU DPX Instructions".Nvidia. RetrievedMay 29, 2023.
  13. ^Harris, Dion (March 22, 2022)."NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions".Nvidia. RetrievedMay 29, 2023.
  14. ^abSalvator, Dave (March 22, 2022)."H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy".Nvidia. RetrievedMay 29, 2023.
  15. ^Elster & Haugdahl 2022, p. 8.
  16. ^"NVIDIA: Grace Hopper Has Entered Full Production & Announcing DGX GH200 AI Supercomputer".Anandtech. May 29, 2023.
  17. ^Pirzada, Usman (November 16, 2019)."NVIDIA Next Generation Hopper GPU Leaked – Based On MCM Design, Launching After Ampere".Wccftech. RetrievedMay 29, 2023.
  18. ^Vincent, James (March 22, 2022)."Nvidia reveals H100 GPU for AI and teases 'world's fastest AI supercomputer'".The Verge. RetrievedMay 29, 2023.
  19. ^"Nvidia tweaks flagship H100 chip for export to China as H800".Reuters. Archived fromthe original on November 22, 2023. RetrievedJanuary 28, 2025.
  20. ^"NVIDIA Prepares H800 Adaptation of H100 GPU for the Chinese Market".TechPowerUp. Archived fromthe original on September 2, 2023. RetrievedJanuary 28, 2025.
  21. ^Leswing, Kif (October 17, 2023)."U.S. curbs export of more AI chips, including Nvidia H800, to China".CNBC. RetrievedJanuary 28, 2025.
  22. ^Fitch, Asa (February 26, 2024)."Nvidia's Stunning Ascent Has Also Made It a Giant Target".The Wall Street Journal. RetrievedFebruary 27, 2024.
  23. ^Vanian, Jonathan (January 18, 2024)."Mark Zuckerberg indicates Meta is spending billions of dollars on Nvidia AI chips".CNBC. RetrievedJune 6, 2024.
  24. ^Bousquette, Isabelle; Lin, Belle (February 14, 2024)."Armored Cars and Trillion Dollar Price Tags: How Some Tech Leaders Want to Solve the Chip Shortage".The Wall Street Journal. RetrievedMay 30, 2024.
  25. ^Smith, Ryan (March 22, 2022)."NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder". AnandTech.
  26. ^Smith, Ryan (May 14, 2020)."NVIDIA Ampere Unleashed: NVIDIA Announces New GPU Architecture, A100 GPU, and Accelerator". AnandTech.
  27. ^"NVIDIA Tesla V100 tested: near unbelievable GPU power".TweakTown. September 17, 2017.

Works cited

[edit]

Further reading

[edit]
Fixed pixel pipeline
Pre-GeForce
Vertex andpixel shaders
Unified shaders
Unified shaders &NUMA
Ray tracing &Tensor Cores
Software and technologies
Multimedia acceleration
Software
Technologies
GPU microarchitectures
Other products
GraphicsWorkstation cards
GPGPU
Console components
Nvidia Shield
SoCs and embedded
CPUs
Computerchipsets
Company
Key people
Acquisitions
Retrieved from "https://en.wikipedia.org/w/index.php?title=Hopper_(microarchitecture)&oldid=1272487592"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp