Commit18d1941

zbpatel

authored and

dc3671

committed

[doc] Update perf_overview.md for release 0.21 (NVIDIA#6270)

Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>

1 parent03430ed commit18d1941Copy full SHA for 18d1941

File tree

1 file changed

+104

-85

lines changed

docs/source/performance
- perf-overview.md

1 file changed

+104

-85

lines changed

`‎docs/source/performance/perf-overview.md‎`

Lines changed: 104 additions & 85 deletions

Original file line number	Diff line number	Diff line change
`@@ -12,6 +12,8 @@ Tuning batch sizes, parallelism configurations, and other options may lead to im`
`12`	`12`
`13`	`13`	`For DeepSeek R1 performance, please check out our[performance guide](../blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)`
`14`	`14`
	`15`	+For more information on benchmarking with`trtllm-bench` see this NVIDIA[blog post](https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm/).
	`16`	`+`
`15`	`17`	`##Throughput Measurements`
`16`	`18`
`17`	`19`	`The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages),`
`@@ -21,50 +23,64 @@ The performance numbers below were collected using the steps described in this d`
`21`	`23`
`22`	`24`	`Testing was performed on models with weights quantized using[ModelOpt](https://nvidia.github.io/TensorRT-Model-Optimizer/#) and published by NVIDIA on the[Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).`
`23`	`25`
`24`		`-###FP4 Models:`
`25`		-```
	`26`	`+###Hardware`
	`27`	`+The following GPU variants were used for testing:`
	`28`	`+- H100 SXM 80GB (DGX H100)`
	`29`	`+- H200 SXM 141GB (DGX H200)`
	`30`	`+- GH200 96GB HBM3 (480GB LPDDR5X)`
	`31`	`+- B200 180GB (DGX B200)`
	`32`	`+- GB200 192GB (GB200 NVL72)`
	`33`	`+`
	`34`	`+Other hardware variants may have different TDP, memory bandwidth, core count, or other features leading to performance differences on these workloads.`
	`35`	`+`
	`36`	`+###FP4 Models`
	`37`	`+`
	`38`	+```text
`26`	`39`	`nvidia/Llama-3.3-70B-Instruct-FP4`
`27`	`40`	`nvidia/Llama-3.1-405B-Instruct-FP4`
`28`	`41`	```
`29`	`42`
`30`	`43`	`####Llama 3.3 70B FP4`
`31`	`44`
`32`		`-\|\| GPU\| B200\|\|\|\|`
`33`		`-\|:------------------------\|:--------\|:----------\|:----------\|:----------\|:----------\|`
`34`		`-\|\| TP Size\| 1\| 2\| 4\| 8\|`
`35`		`-\| ISL, OSL\|\|\|\|\|\|`
`36`		`-\|\|\|\|\|\|\|`
`37`		`-\| 128, 128\|\| 10,994.48\| 17,542.11\| 24,667.31\| 27,272.27\|`
`38`		`-\| 128, 2048\|\| 9,580.46\| 15,432.35\| 23,568.12\| 31,174.31\|`
`39`		`-\| 128, 4096\|\| 6,418.39\| 9,841.53\| 17,808.76\| 25,229.25\|`
`40`		`-\| 500, 2000\|\| 7,343.32\| 11,850.57\| 20,709.67\| 28,038.78\|`
`41`		`-\| 1000, 1000\|\| 6,752.53\| 10,815.88\| 16,413.04\| 20,060.66\|`
`42`		`-\| 1000, 2000\|\| 6,670.07\| 9,830.73\| 15,597.49\| 20,672.37\|`
`43`		`-\| 1024, 2048\|\| 6,636.75\| 9,807.13\| 15,519.23\| 20,617.28\|`
`44`		`-\| 2048, 128\|\| 1,342.17\| 1,989.41\| 3,033.14\| 4,035.64\|`
`45`		`-\| 5000, 500\|\| 1,429.67\| 2,419.67\| 3,686.84\| 5,182.96\|`
`46`		`-\| 20000, 2000\|\| 629.77\| 1,177.01\| 2,120.66\| 3,429.03\|`
	`45`	`+\|\| GPU:\| B200\| GB200\|`
	`46`	`+\|:-----------------------------\|:---\|:----------\|:--------------\|`
	`47`	`+\|\| TP Size\| 1\| 1\|`
	`48`	`+\| ISL, OSL\|\|\|\|`
	`49`	`+\|\|\|\|\|`
	`50`	`+\| 128, 128\|\| 10,613.84\| 11,100.97\|`
	`51`	`+\| 128, 2048\|\| 9,445.51\| 10,276.05\|`
	`52`	`+\| 128, 4096\|\| 6,276.85\| 7,351.12\|`
	`53`	`+\| 500, 2000\|\| 6,983.27\| 8,194.30\|`
	`54`	`+\| 1000, 1000\|\| 6,434.29\| 7,401.80\|`
	`55`	`+\| 1000, 2000\|\| 6,725.03\| 6,478.72\|`
	`56`	`+\| 1024, 2048\|\| 6,546.61\| 7,922.88\|`
	`57`	`+\| 2048, 128\|\| 1,330.35\| 1,418.47\|`
	`58`	`+\| 2048, 2048\|\| 4,528.48\| 5,326.77\|`
	`59`	`+\| 5000, 500\|\| 1,427.44\| 1,502.44\|`
	`60`	`+\| 20000, 2000\|\| 636.36\| 732.43\|`
`47`	`61`
`48`	`62`	`####Llama 3.1 405B FP4`
`49`	`63`
`50`		`-\|\| GPU\| B200\|\|`
`51`		`-\|:------------------------\|:-------\|:---------\|:----------\|`
`52`		`-\|\| TP Size\| 4\| 8\|`
`53`		`-\| ISL, OSL\|\|\|\|`
`54`		`-\|\|\|\|\|`
`55`		`-\| 128, 128\|\| 6,163.81\| 9,002.90\|`
`56`		`-\| 128, 2048\|\| 7,081.21\| 10,288.28\|`
`57`		`-\| 128, 4096\|\| 6,028.37\| 8,713.77\|`
`58`		`-\| 500, 2000\|\| 5,858.75\| 9,125.86\|`
`59`		`-\| 1000, 1000\|\| 4,848.00\| 7,582.97\|`
`60`		`-\| 1000, 2000\|\| 5,375.25\| 7,626.28\|`
`61`		`-\| 1024, 2048\|\| 5,345.70\| 7,464.03\|`
`62`		`-\| 2048, 128\|\| 693.55\| 1,086.56\|`
`63`		`-\| 5000, 500\|\| 947.49\| 1,532.45\|`
`64`		`-\| 20000, 2000\|\| 641.11\| 1,097.84\|`
`65`		`-`
`66`		`-###FP8 Models:`
`67`		-```
	`64`	`+\|\| GPU:\| B200\| GB200\|`
	`65`	`+\|:-----------------------------\|:---\|:---------\|:--------------\|`
	`66`	`+\|\| TP Size\| 4\| 4\|`
	`67`	`+\| ISL, OSL\|\|\|\|`
	`68`	`+\|\|\|\|\|`
	`69`	`+\| 128, 128\|\| 6,218.89\| 6,598.97\|`
	`70`	`+\| 128, 2048\|\| 7,178.10\| 7,497.40\|`
	`71`	`+\| 128, 4096\|\| 5,890.89\| 5,898.19\|`
	`72`	`+\| 500, 2000\|\| 5,844.37\| 6,198.33\|`
	`73`	`+\| 1000, 1000\|\| 4,958.53\| 5,243.35\|`
	`74`	`+\| 1000, 2000\|\| 4,874.16\| 4,905.51\|`
	`75`	`+\| 1024, 2048\|\| 4,833.19\| 4,686.38\|`
	`76`	`+\| 2048, 128\|\| 737.95\| 761.58\|`
	`77`	`+\| 2048, 2048\|\| 4,024.02\| 4,326.56\|`
	`78`	`+\| 5000, 500\|\| 1,032.40\| 1,078.87\|`
	`79`	`+\| 20000, 2000\|\| 667.39\| 649.95\|`
	`80`	`+`
	`81`	`+###FP8 Models`
	`82`	`+`
	`83`	+```text
`68`	`84`	`nvidia/Llama-3.1-8B-Instruct-FP8`
`69`	`85`	`nvidia/Llama-3.3-70B-Instruct-FP8`
`70`	`86`	`nvidia/Llama-3.1-405B-Instruct-FP8`
`@@ -73,61 +89,65 @@ nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8`
`73`	`89`
`74`	`90`	`####Llama 3.1 8B FP8`
`75`	`91`
`76`		`-\|\| GPU\| H200 141GB HBM3\| H100 80GB HBM3\|`
`77`		`-\|:-----------------------------\|:---\|:------------------\|:-----------------\|`
`78`		`-\|\| TP Size\| 1\| 1\|`
`79`		`-\| ISL, OSL\|\|\|\|`
`80`		`-\|\|\|\|\|`
`81`		`-\| 128, 128\|\| 27,970.14\| 27,688.36\|`
`82`		`-\| 128, 2048\|\| 23,326.38\| 21,841.15\|`
`83`		`-\| 128, 4096\|\| 17,508.51\| 13,730.89\|`
`84`		`-\| 500, 2000\|\| 21,390.41\| 17,833.34\|`
`85`		`-\| 1000, 1000\|\| 17,366.89\| 15,270.62\|`
`86`		`-\| 1000, 2000\|\| 16,831.31\| 13,798.08\|`
`87`		`-\| 1024, 2048\|\| 16,737.03\| 13,385.50\|`
`88`		`-\| 2048, 128\|\| 3,488.03\| 3,414.67\|`
`89`		`-\| 5000, 500\|\| 3,813.69\| 3,394.54\|`
`90`		`-\| 20000, 2000\|\| 1,696.66\| 1,345.42\|`
	`92`	`+\|\| GPU:\| GH200\| H100\| H200\|`
	`93`	`+\|:-----------------------------\|:---\|:--------------\|:-----------------\|:------------------\|`
	`94`	`+\|\| TP Size\| 1\| 1\| 1\|`
	`95`	`+\| ISL, OSL\|\|\|\|\|`
	`96`	`+\|\|\|\|\|\|`
	`97`	`+\| 128, 128\|\| 27,304.25\| 26,401.48\| 27,027.80\|`
	`98`	`+\| 128, 2048\|\| 24,045.60\| 21,413.21\| 23,102.25\|`
	`99`	`+\| 128, 4096\|\| 15,409.85\| 13,541.54\| 17,396.83\|`
	`100`	`+\| 500, 2000\|\| 20,123.88\| 17,571.01\| 19,759.16\|`
	`101`	`+\| 1000, 1000\|\| 16,352.99\| 14,991.62\| 17,162.49\|`
	`102`	`+\| 1000, 2000\|\| 15,705.82\| 13,505.23\| 16,227.11\|`
	`103`	`+\| 1024, 2048\|\| 16,102.52\| 13,165.91\| 16,057.66\|`
	`104`	`+\| 2048, 128\|\| 3,573.85\| 3,275.55\| 3,390.69\|`
	`105`	`+\| 2048, 2048\|\| 10,767.05\| 9,462.43\| 11,822.14\|`
	`106`	`+\| 5000, 500\|\| 3,584.74\| 3,276.47\| 3,758.08\|`
	`107`	`+\| 20000, 2000\|\| 1,393.31\| 1,340.69\| 1,705.68\|`
`91`	`108`
`92`	`109`	`####Llama 3.3 70B FP8`
`93`	`110`
`94`		`-\|\| GPU\| H200 141GB HBM3\|\|\|\|H10080GB HBM3\|\|\|\|`
`95`		`-\|:-----------------------------\|:---\|:------------------\|:---------\|:----------\|:----------\|:-----------------\|:---------\|:----------\|:----------\|`
`96`		`-\|\| TP Size\|1\|2\| 4\| 8\| 1\|2\| 4\| 8\|`
`97`		`-\| ISL, OSL\|\|\|\|\|\|\|\|\|\|`
`98`		`-\|\|\|\|\|\|\|\|\|\|\|`
`99`		`-\| 128, 128\|\|3,605.47\| 6,427.69\| 10,407.42\| 15,434.37\| 3,128.33\| 6,216.91\|\|\|`
`100`		`-\| 128, 2048\|\|4,315.80\| 8,464.03\| 13,508.59\| 20,759.72\| 756.42\| 5,782.57\| 11,464.94\| 17,424.32\|`
`101`		`-\| 128, 4096\|\|2,701.17\| 5,573.55\| 11,458.56\| 16,668.75\|\| 3,868.37\| 8,206.39\| 12,624.61\|`
`102`		`-\| 500, 2000\|\|3,478.76\| 6,740.06\| 12,200.18\|\|\| 4,684.06\| 9,903.53\| 14,553.93\|`
`103`		`-\| 1000, 1000\|\|2,744.32\| 5,119.72\| 8,685.44\| 12,744.51\| 742.14\| 4,247.19\| 7,435.65\| 11,018.81\|`
`104`		`-\| 1000, 2000\|\|2,896.44\| 5,847.26\| 9,031.21\| 13,141.17\| 533.74\| 3,866.53\| 7,611.12\| 11,139.22\|`
`105`		`-\| 1024, 2048\|\|2,874.18\| 5,568.61\| 8,946.71\| 13,082.62\| 530.16\| 3,796.68\| 7,575.24\| 11,004.31\|`
`106`		`-\| 2048, 128\|\|435.90\| 772.67\| 1,264.76\|\|\| 736.89\| 1,213.33\| 1,839.22\|`
`107`		`-\| 2048, 2048\|\|\|\|\| 10,412.85\|\|\|\|\|`
`108`		`-\| 5000, 500\|\|545.96\| 997.15\| 1,698.22\| 2,655.28\| 204.94\| 862.91\| 1,552.68\| 2,369.84\|`
`109`		`-\| 20000, 2000\|\|276.66\| 620.33\| 1,161.29\| 1,985.85\|\| 416.13\| 903.66\| 1,554.10\|`
	`111`	`+\|\| GPU:\|H100\| H200\|`
	`112`	`+\|:-----------------------------\|:---\|:-----------------\|:------------------\|`
	`113`	`+\|\| TP Size\| 2\|2\|`
	`114`	`+\| ISL, OSL\|\|\|\|`
	`115`	`+\|\|\|\|\|`
	`116`	`+\| 128, 128\|\|6,092.28\| 6,327.98\|`
	`117`	`+\| 128, 2048\|\|5,892.94\| 7,467.36\|`
	`118`	`+\| 128, 4096\|\|3,828.46\| 5,526.42\|`
	`119`	`+\| 500, 2000\|\|4,654.74\| 6,639.15\|`
	`120`	`+\| 1000, 1000\|\|4,181.06\| 4,773.33\|`
	`121`	`+\| 1000, 2000\|\|3,708.93\| 5,790.36\|`
	`122`	`+\| 1024, 2048\|\|3,785.04\| 5,480.44\|`
	`123`	`+\| 2048, 128\|\|723.40\| 747.55\|`
	`124`	`+\| 2048, 2048\|\|2,785.53\| 3,775.80\|`
	`125`	`+\| 5000, 500\|\|865.55\| 978.28\|`
	`126`	`+\| 20000, 2000\|\|411.85\| 609.42\|`
`110`	`127`
`111`	`128`	`####Llama 3.1 405B FP8`
`112`		`-`
`113`		`-\|\| GPU\| H200 141GB HBM3\| H100 80GB HBM3\|`
`114`		`-\|:-----------------------------\|:---\|:------------------\|:-----------------\|`
`115`		`-\|\| TP Size\| 8\| 8\|`
`116`		`-\| ISL, OSL\|\|\|\|`
`117`		`-\|\|\|\|\|`
`118`		`-\| 128, 2048\|\| 5,567.87\|\|`
`119`		`-\| 128, 4096\|\| 5,136.85\|\|`
`120`		`-\| 500, 2000\|\| 4,787.61\| 3,673.91\|`
`121`		`-\| 1000, 1000\|\| 3,286.30\| 3,012.22\|`
`122`		`-\| 1000, 2000\|\| 3,636.76\| 3,262.20\|`
`123`		`-\| 1024, 2048\|\| 3,618.66\| 3,109.70\|`
`124`		`-\| 2048, 128\|\| 443.10\| 449.02\|`
`125`		`-\| 5000, 500\|\| 645.46\|\|`
`126`		`-\| 20000, 2000\|\|\| 372.12\|`
	`129`	`+\|\| GPU:\| H100\| H200\|`
	`130`	`+\|:-----------------------------\|:---\|:-----------------\|:------------------\|`
	`131`	`+\|\| TP Size\| 8\| 8\|`
	`132`	`+\| Runtime Input/Output Lengths\|\|\|\|`
	`133`	`+\|\|\|\|\|`
	`134`	`+\| 128, 128\|\|\| 3,705.18\|`
	`135`	`+\| 128, 2048\|\| 4,517.39\| 4,715.13\|`
	`136`	`+\| 128, 4096\|\| 2,910.31\| 4,475.91\|`
	`137`	`+\| 500, 2000\|\| 3,664.62\| 4,804.10\|`
	`138`	`+\| 1000, 1000\|\| 2,955.50\| 3,208.25\|`
	`139`	`+\| 1000, 2000\|\| 2,884.69\| 3,630.29\|`
	`140`	`+\| 1024, 2048\|\| 3,237.41\| 3,609.50\|`
	`141`	`+\| 2048, 128\|\| 433.47\| 441.35\|`
	`142`	`+\| 2048, 2048\|\| 2,216.55\| 2,840.86\|`
	`143`	`+\| 5000, 500\|\| 579.05\| 645.26\|`
	`144`	`+\| 20000, 2000\|\| 363.27\| 509.87\|`
`127`	`145`
`128`	`146`	`####Llama 4 Maverick FP8`
`129`	`147`
`130`		`-\|\| GPU\| H200 141GB HBM3\| H100 80GB HBM3\|`
	`148`	`+Note: Performance for Llama 4 on sequence lengths less than 8,192 tokens is affected by an issue introduced in v0.21. To reproduce the Llama 4 performance noted here, please use v0.20`
	`149`	`+`
	`150`	`+\|\| GPU\| H200\| H100\|`
`131`	`151`	`\|:-----------------------------\|:---\|:------------------\|:-----------------\|`
`132`	`152`	`\|\| TP Size\| 8\| 8\|`
`133`	`153`	`\| ISL, OSL\|\|\|\|`
`@@ -140,7 +160,6 @@ nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8`
`140`	`160`	`\| 2048, 128\|\| 4,364.06\| 3,832.38\|`
`141`	`161`	`\| 2048, 2048\|\| 12,800.89\|\|`
`142`	`162`	`\| 5000, 500\|\| 5,128.60\|\|`
`143`		`-\| 20000, 2000\|\| 1,764.27\| 1,400.79\|`
`144`	`163`
`145`	`164`	`##Reproducing Benchmarked Results`
`146`	`165`
`@@ -216,7 +235,7 @@ a model name (HuggingFace reference or path to a local model), a [generated data`
`216`	`235`	`trtllm-bench --model$model_name throughput --dataset$dataset_file --backend pytorch --extra_llm_api_options$llm_options`
`217`	`236`	```
`218`	`237`
`219`		`-The data collected for the v0.20 benchmarks was run with the following file:`
	`238`	`+The data collected for the v0.21 benchmarks was run with the following file:`
`220`	`239`
`221`	`240`	`llm_options.yml`
`222`	`241`	```yaml
`@@ -240,7 +259,7 @@ cuda_graph_config:`
`240`	`259`	`-8192`
`241`	`260`	```
`242`	`261`
`243`		-Ina majority ofcases, we also use a higher KV cache percentage by setting`--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90`if we hit anout ofmemoryissue.
	`262`	+Inmanycases, we also use a higher KV cache percentage by setting`--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90`or lower ifout-of-memoryerrors are encountered.
`244`	`263`
`245`	`264`	`The results will be printed to the terminal upon benchmark completion. For example,`
`246`	`265`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit18d1941

File tree

1 file changed

1 file changed

`‎docs/source/performance/perf-overview.md‎`

0 commit comments