Commitd69b27e

committed

doc: remove the outdated features which marked as Experimental

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

1 parentc7ffadf commitd69b27eCopy full SHA for d69b27e

File tree

4 files changed

-17

lines changed

docs/source
- advanced
  - gpt-attention.md
  - speculative-decoding.md
- performance
  - perf-benchmarking.md
- torch.md

4 files changed

-17

lines changed

`‎docs/source/advanced/gpt-attention.md‎`

Lines changed: 0 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -112,8 +112,6 @@ printed.`
`112`	`112`	`####XQA Optimization`
`113`	`113`
`114`	`114`	`Another optimization for MQA/GQA in generation phase called XQA optimization.`
`115`		`-It is still experimental feature and support limited configurations. LLAMA2 70B`
`116`		`-is one model that it supports.`
`117`	`115`
`118`	`116`	`Support matrix of the XQA optimization:`
`119`	`117`	`- FP16 / BF16 compute data type.`

`‎docs/source/advanced/speculative-decoding.md‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -167,7 +167,7 @@ TensorRT-LLM implements the ReDrafter model such that logits prediction, beam se`
`167`	`167`
`168`	`168`	The EAGLE approach enhances the single-model Medusa method by predicting and verifying tokens using the same model. Similarly to ReDrafter, it predicts draft tokens using a recurrent predictor where each draft token depends on the previous one. However, unlike ReDrafter, it uses a single-layer transformer model to predict draft tokens from previous hidden states and decoded tokens. In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam.
`169`	`169`
`170`		`-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine.EAGLE-1 and EAGLE-2 are both supported, while EAGLE-2 is currently in the experimental stage. Please, visit the[EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.`
	`170`	`+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the[EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.`
`171`	`171`
`172`	`172`	`##Lookahead Decoding`
`173`	`173`

`‎docs/source/performance/perf-benchmarking.md‎`

Lines changed: 0 additions & 9 deletions

Original file line number	Diff line number	Diff line change
`@@ -236,15 +236,6 @@ The following command builds an FP8 quantized engine by specifying the engine tu`
`236`	`236`	`trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --max_batch_size 1024 --max_num_tokens 2048`
`237`	`237`	```
`238`	`238`
`239`		`--[Experimental] Build engine with target ISL/OSL for optimization:`
`240`		-In this experimental mode, you can provide hints to`trtllm-bench`'s tuning heuristic to optimize the engine on specific ISL and OSL targets.
`241`		`-Generally, the target ISL and OSL aligns with the average ISL and OSL of the dataset, but you can experiment with different values to optimize the engine using this mode.`
`242`		`-The following command builds an FP8 quantized engine and optimizes for ISL:OSL targets of 128:128.`
`243`		`-`
`244`		-```shell
`245`		`-trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --target_isl 128 --target_osl 128`
`246`		-```
`247`		`-`
`248`	`239`
`249`	`240`	`####Parallelism Mapping Support`
`250`	`241`	The`trtllm-bench build` subcommand supports combinations of tensor-parallel (TP) and pipeline-parallel (PP) mappings as long as the world size (`tp_size x pp_size`)`<=``8`. The parallelism mapping in build subcommad is controlled by`--tp_size` and`--pp_size` options. The following command builds an engine with TP2-PP2 mapping.

`‎docs/source/torch.md‎`

Lines changed: 1 addition & 5 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,11 +1,7 @@`
`1`	`1`	`#PyTorch Backend`
`2`	`2`
`3`		-```{note}
`4`		`-Note:`
`5`		`-This feature is currently experimental, and the related API is subjected to change in future versions.`
`6`		-```
`7`	`3`
`8`		`-To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a newexperimentalbackend based on PyTorch.`
	`4`	`+To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new backend based on PyTorch.`
`9`	`5`
`10`	`6`	The PyTorch backend of TensorRT-LLM is available in version 0.17 and later. You can try it via importing`tensorrt_llm._torch`.
`11`	`7`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitd69b27e

File tree

4 files changed

4 files changed

`‎docs/source/advanced/gpt-attention.md‎`

`‎docs/source/advanced/speculative-decoding.md‎`

`‎docs/source/performance/perf-benchmarking.md‎`

`‎docs/source/torch.md‎`

0 commit comments