Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit18d1941

Browse files
zbpateldc3671
authored andcommitted
[doc] Update perf_overview.md for release 0.21 (NVIDIA#6270)
Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
1 parent03430ed commit18d1941

File tree

1 file changed

+104
-85
lines changed

1 file changed

+104
-85
lines changed

‎docs/source/performance/perf-overview.md‎

Lines changed: 104 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ Tuning batch sizes, parallelism configurations, and other options may lead to im
1212

1313
For DeepSeek R1 performance, please check out our[performance guide](../blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)
1414

15+
For more information on benchmarking with`trtllm-bench` see this NVIDIA[blog post](https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm/).
16+
1517
##Throughput Measurements
1618

1719
The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages),
@@ -21,50 +23,64 @@ The performance numbers below were collected using the steps described in this d
2123

2224
Testing was performed on models with weights quantized using[ModelOpt](https://nvidia.github.io/TensorRT-Model-Optimizer/#) and published by NVIDIA on the[Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).
2325

24-
###FP4 Models:
25-
```
26+
###Hardware
27+
The following GPU variants were used for testing:
28+
- H100 SXM 80GB (DGX H100)
29+
- H200 SXM 141GB (DGX H200)
30+
- GH200 96GB HBM3 (480GB LPDDR5X)
31+
- B200 180GB (DGX B200)
32+
- GB200 192GB (GB200 NVL72)
33+
34+
Other hardware variants may have different TDP, memory bandwidth, core count, or other features leading to performance differences on these workloads.
35+
36+
###FP4 Models
37+
38+
```text
2639
nvidia/Llama-3.3-70B-Instruct-FP4
2740
nvidia/Llama-3.1-405B-Instruct-FP4
2841
```
2942

3043
####Llama 3.3 70B FP4
3144

32-
|| GPU| B200||||
33-
|:------------------------|:--------|:----------|:----------|:----------|:----------|
34-
|| TP Size| 1| 2| 4| 8|
35-
| ISL, OSL||||||
36-
|||||||
37-
| 128, 128|| 10,994.48| 17,542.11| 24,667.31| 27,272.27|
38-
| 128, 2048|| 9,580.46| 15,432.35| 23,568.12| 31,174.31|
39-
| 128, 4096|| 6,418.39| 9,841.53| 17,808.76| 25,229.25|
40-
| 500, 2000|| 7,343.32| 11,850.57| 20,709.67| 28,038.78|
41-
| 1000, 1000|| 6,752.53| 10,815.88| 16,413.04| 20,060.66|
42-
| 1000, 2000|| 6,670.07| 9,830.73| 15,597.49| 20,672.37|
43-
| 1024, 2048|| 6,636.75| 9,807.13| 15,519.23| 20,617.28|
44-
| 2048, 128|| 1,342.17| 1,989.41| 3,033.14| 4,035.64|
45-
| 5000, 500|| 1,429.67| 2,419.67| 3,686.84| 5,182.96|
46-
| 20000, 2000|| 629.77| 1,177.01| 2,120.66| 3,429.03|
45+
|| GPU:| B200| GB200|
46+
|:-----------------------------|:---|:----------|:--------------|
47+
|| TP Size| 1| 1|
48+
| ISL, OSL||||
49+
|||||
50+
| 128, 128|| 10,613.84| 11,100.97|
51+
| 128, 2048|| 9,445.51| 10,276.05|
52+
| 128, 4096|| 6,276.85| 7,351.12|
53+
| 500, 2000|| 6,983.27| 8,194.30|
54+
| 1000, 1000|| 6,434.29| 7,401.80|
55+
| 1000, 2000|| 6,725.03| 6,478.72|
56+
| 1024, 2048|| 6,546.61| 7,922.88|
57+
| 2048, 128|| 1,330.35| 1,418.47|
58+
| 2048, 2048|| 4,528.48| 5,326.77|
59+
| 5000, 500|| 1,427.44| 1,502.44|
60+
| 20000, 2000|| 636.36| 732.43|
4761

4862
####Llama 3.1 405B FP4
4963

50-
|| GPU| B200||
51-
|:------------------------|:-------|:---------|:----------|
52-
|| TP Size| 4| 8|
53-
| ISL, OSL||||
54-
|||||
55-
| 128, 128|| 6,163.81| 9,002.90|
56-
| 128, 2048|| 7,081.21| 10,288.28|
57-
| 128, 4096|| 6,028.37| 8,713.77|
58-
| 500, 2000|| 5,858.75| 9,125.86|
59-
| 1000, 1000|| 4,848.00| 7,582.97|
60-
| 1000, 2000|| 5,375.25| 7,626.28|
61-
| 1024, 2048|| 5,345.70| 7,464.03|
62-
| 2048, 128|| 693.55| 1,086.56|
63-
| 5000, 500|| 947.49| 1,532.45|
64-
| 20000, 2000|| 641.11| 1,097.84|
65-
66-
###FP8 Models:
67-
```
64+
|| GPU:| B200| GB200|
65+
|:-----------------------------|:---|:---------|:--------------|
66+
|| TP Size| 4| 4|
67+
| ISL, OSL||||
68+
|||||
69+
| 128, 128|| 6,218.89| 6,598.97|
70+
| 128, 2048|| 7,178.10| 7,497.40|
71+
| 128, 4096|| 5,890.89| 5,898.19|
72+
| 500, 2000|| 5,844.37| 6,198.33|
73+
| 1000, 1000|| 4,958.53| 5,243.35|
74+
| 1000, 2000|| 4,874.16| 4,905.51|
75+
| 1024, 2048|| 4,833.19| 4,686.38|
76+
| 2048, 128|| 737.95| 761.58|
77+
| 2048, 2048|| 4,024.02| 4,326.56|
78+
| 5000, 500|| 1,032.40| 1,078.87|
79+
| 20000, 2000|| 667.39| 649.95|
80+
81+
###FP8 Models
82+
83+
```text
6884
nvidia/Llama-3.1-8B-Instruct-FP8
6985
nvidia/Llama-3.3-70B-Instruct-FP8
7086
nvidia/Llama-3.1-405B-Instruct-FP8
@@ -73,61 +89,65 @@ nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
7389

7490
####Llama 3.1 8B FP8
7591

76-
|| GPU| H200 141GB HBM3| H100 80GB HBM3|
77-
|:-----------------------------|:---|:------------------|:-----------------|
78-
|| TP Size| 1| 1|
79-
| ISL, OSL||||
80-
|||||
81-
| 128, 128|| 27,970.14| 27,688.36|
82-
| 128, 2048|| 23,326.38| 21,841.15|
83-
| 128, 4096|| 17,508.51| 13,730.89|
84-
| 500, 2000|| 21,390.41| 17,833.34|
85-
| 1000, 1000|| 17,366.89| 15,270.62|
86-
| 1000, 2000|| 16,831.31| 13,798.08|
87-
| 1024, 2048|| 16,737.03| 13,385.50|
88-
| 2048, 128|| 3,488.03| 3,414.67|
89-
| 5000, 500|| 3,813.69| 3,394.54|
90-
| 20000, 2000|| 1,696.66| 1,345.42|
92+
|| GPU:| GH200| H100| H200|
93+
|:-----------------------------|:---|:--------------|:-----------------|:------------------|
94+
|| TP Size| 1| 1| 1|
95+
| ISL, OSL|||||
96+
||||||
97+
| 128, 128|| 27,304.25| 26,401.48| 27,027.80|
98+
| 128, 2048|| 24,045.60| 21,413.21| 23,102.25|
99+
| 128, 4096|| 15,409.85| 13,541.54| 17,396.83|
100+
| 500, 2000|| 20,123.88| 17,571.01| 19,759.16|
101+
| 1000, 1000|| 16,352.99| 14,991.62| 17,162.49|
102+
| 1000, 2000|| 15,705.82| 13,505.23| 16,227.11|
103+
| 1024, 2048|| 16,102.52| 13,165.91| 16,057.66|
104+
| 2048, 128|| 3,573.85| 3,275.55| 3,390.69|
105+
| 2048, 2048|| 10,767.05| 9,462.43| 11,822.14|
106+
| 5000, 500|| 3,584.74| 3,276.47| 3,758.08|
107+
| 20000, 2000|| 1,393.31| 1,340.69| 1,705.68|
91108

92109
####Llama 3.3 70B FP8
93110

94-
|| GPU| H200 141GB HBM3||||H10080GB HBM3||||
95-
|:-----------------------------|:---|:------------------|:---------|:----------|:----------|:-----------------|:---------|:----------|:----------|
96-
|| TP Size|1|2| 4| 8| 1|2| 4| 8|
97-
| ISL, OSL||||||||||
98-
|||||||||||
99-
| 128, 128||3,605.47| 6,427.69| 10,407.42| 15,434.37| 3,128.33| 6,216.91|||
100-
| 128, 2048||4,315.80| 8,464.03| 13,508.59| 20,759.72| 756.42| 5,782.57| 11,464.94| 17,424.32|
101-
| 128, 4096||2,701.17| 5,573.55| 11,458.56| 16,668.75|| 3,868.37| 8,206.39| 12,624.61|
102-
| 500, 2000||3,478.76| 6,740.06| 12,200.18||| 4,684.06| 9,903.53| 14,553.93|
103-
| 1000, 1000||2,744.32| 5,119.72| 8,685.44| 12,744.51| 742.14| 4,247.19| 7,435.65| 11,018.81|
104-
| 1000, 2000||2,896.44| 5,847.26| 9,031.21| 13,141.17| 533.74| 3,866.53| 7,611.12| 11,139.22|
105-
| 1024, 2048||2,874.18| 5,568.61| 8,946.71| 13,082.62| 530.16| 3,796.68| 7,575.24| 11,004.31|
106-
| 2048, 128||435.90| 772.67| 1,264.76||| 736.89| 1,213.33| 1,839.22|
107-
| 2048, 2048||||| 10,412.85|||||
108-
| 5000, 500||545.96| 997.15| 1,698.22| 2,655.28| 204.94| 862.91| 1,552.68| 2,369.84|
109-
| 20000, 2000||276.66| 620.33| 1,161.29| 1,985.85|| 416.13| 903.66| 1,554.10|
111+
|| GPU:|H100| H200|
112+
|:-----------------------------|:---|:-----------------|:------------------|
113+
|| TP Size| 2|2|
114+
| ISL, OSL||||
115+
|||||
116+
| 128, 128||6,092.28| 6,327.98|
117+
| 128, 2048||5,892.94| 7,467.36|
118+
| 128, 4096||3,828.46| 5,526.42|
119+
| 500, 2000||4,654.74| 6,639.15|
120+
| 1000, 1000||4,181.06| 4,773.33|
121+
| 1000, 2000||3,708.93| 5,790.36|
122+
| 1024, 2048||3,785.04| 5,480.44|
123+
| 2048, 128||723.40| 747.55|
124+
| 2048, 2048||2,785.53| 3,775.80|
125+
| 5000, 500||865.55| 978.28|
126+
| 20000, 2000||411.85| 609.42|
110127

111128
####Llama 3.1 405B FP8
112-
113-
|| GPU| H200 141GB HBM3| H100 80GB HBM3|
114-
|:-----------------------------|:---|:------------------|:-----------------|
115-
|| TP Size| 8| 8|
116-
| ISL, OSL||||
117-
|||||
118-
| 128, 2048|| 5,567.87||
119-
| 128, 4096|| 5,136.85||
120-
| 500, 2000|| 4,787.61| 3,673.91|
121-
| 1000, 1000|| 3,286.30| 3,012.22|
122-
| 1000, 2000|| 3,636.76| 3,262.20|
123-
| 1024, 2048|| 3,618.66| 3,109.70|
124-
| 2048, 128|| 443.10| 449.02|
125-
| 5000, 500|| 645.46||
126-
| 20000, 2000||| 372.12|
129+
|| GPU:| H100| H200|
130+
|:-----------------------------|:---|:-----------------|:------------------|
131+
|| TP Size| 8| 8|
132+
| Runtime Input/Output Lengths||||
133+
|||||
134+
| 128, 128||| 3,705.18|
135+
| 128, 2048|| 4,517.39| 4,715.13|
136+
| 128, 4096|| 2,910.31| 4,475.91|
137+
| 500, 2000|| 3,664.62| 4,804.10|
138+
| 1000, 1000|| 2,955.50| 3,208.25|
139+
| 1000, 2000|| 2,884.69| 3,630.29|
140+
| 1024, 2048|| 3,237.41| 3,609.50|
141+
| 2048, 128|| 433.47| 441.35|
142+
| 2048, 2048|| 2,216.55| 2,840.86|
143+
| 5000, 500|| 579.05| 645.26|
144+
| 20000, 2000|| 363.27| 509.87|
127145

128146
####Llama 4 Maverick FP8
129147

130-
|| GPU| H200 141GB HBM3| H100 80GB HBM3|
148+
Note: Performance for Llama 4 on sequence lengths less than 8,192 tokens is affected by an issue introduced in v0.21. To reproduce the Llama 4 performance noted here, please use v0.20
149+
150+
|| GPU| H200| H100|
131151
|:-----------------------------|:---|:------------------|:-----------------|
132152
|| TP Size| 8| 8|
133153
| ISL, OSL||||
@@ -140,7 +160,6 @@ nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
140160
| 2048, 128|| 4,364.06| 3,832.38|
141161
| 2048, 2048|| 12,800.89||
142162
| 5000, 500|| 5,128.60||
143-
| 20000, 2000|| 1,764.27| 1,400.79|
144163

145164
##Reproducing Benchmarked Results
146165

@@ -216,7 +235,7 @@ a model name (HuggingFace reference or path to a local model), a [generated data
216235
trtllm-bench --model$model_name throughput --dataset$dataset_file --backend pytorch --extra_llm_api_options$llm_options
217236
```
218237

219-
The data collected for the v0.20 benchmarks was run with the following file:
238+
The data collected for the v0.21 benchmarks was run with the following file:
220239

221240
`llm_options.yml`
222241
```yaml
@@ -240,7 +259,7 @@ cuda_graph_config:
240259
-8192
241260
```
242261
243-
Ina majority ofcases, we also use a higher KV cache percentage by setting`--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90`if we hit anout ofmemoryissue.
262+
Inmanycases, we also use a higher KV cache percentage by setting`--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90`or lower ifout-of-memoryerrors are encountered.
244263

245264
The results will be printed to the terminal upon benchmark completion. For example,
246265

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp