PyTorch Backend (Experimental --> Beta)
Disagg-serving (Experimental --> Prototype)
AutoDeploy (Experimental --> Prototype)
Use tensorrtllm_backend for triton inference server (Experimental --> Prototype)

Summary by CodeRabbit

Documentation
- Removed references to features and techniques being "experimental" or subject to change across multiple documentation pages and READMEs.
- Clarified default behavior and support contexts for specific features in the documentation.
- Updated explanations and recommendations for FP8 GEMV/GEMM plugin usage, providing more detail and clearer guidance.
- Simplified or removed descriptions of deprecated or experimental build modes and configuration options.
- Updated feature status descriptions from "experimental" to "prototype" or "beta" in various documentation and example READMEs.

nv-guomingz requested review fromQiJune andlowsfer

July 14, 2025 08:30

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch from107dbb3 toa808cc8Compare

July 14, 2025 08:39

nv-guomingz requested a review froma team as acode owner

July 14, 2025 08:39

nv-guomingz requested a review fromlucaslie

July 14, 2025 08:39

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch froma808cc8 tod69b27eCompare

July 14, 2025 08:48

nv-guomingz requested review fromkaiyux andyweng0828

July 14, 2025 08:48

yweng0828 approved these changes

Jul 14, 2025

View reviewed changes

nv-guomingz requested a review fromBarry-Delaney

July 14, 2025 08:56

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch fromd69b27e to909bcb1Compare

July 14, 2025 08:56

Barry-Delaney approved these changes

Jul 14, 2025

View reviewed changes

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch from909bcb1 toe3f1e8cCompare

July 14, 2025 11:37

Njuapp reviewed

Jul 14, 2025

View reviewed changes

examples/models/core/llama/README.md OutdatedShow resolvedHide resolved

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch 2 times, most recently fromcc18db1 toc6a80d1Compare

July 14, 2025 14:08

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch fromc6a80d1 toea7d44cCompare

July 28, 2025 17:02

Copy link

Contributor

coderabbitaibot commentedJul 28, 2025•
edited
Loading

📝 Walkthrough

Walkthrough

This update modifies documentation files to remove or reword references to "experimental" status for several features, clarify default behaviors, and update technical explanations. No changes to code or public interfaces are present; all modifications are limited to documentation content and README files.

Changes

Cohort / File(s)	Change Summary
Experimental Status Removal (General) `docs/source/advanced/gpt-attention.md`,`docs/source/torch.md`,`examples/eagle/README.md`,`docs/source/reference/precision.md`,`README.md`,`docs/source/advanced/disaggregated-service.md`,`examples/auto_deploy/README.md`,`examples/disaggregated/README.md`,`examples/models/core/deepseek_v3/README.md`,`examples/sample_weight_stripping/README.md`	Removed or replaced references to features being "experimental" with "prototype" or "beta" status for XQA optimization, PyTorch backend, EAGLE-2, quantization examples, AutoDeploy backend, disaggregated service, dynamic scaling, tensorrtllm_backend for Triton, and sample weight stripping. No functional changes made.
Speculative Decoding Documentation `docs/source/advanced/speculative-decoding.md`	Reworded the description of EAGLE speculative decoding to consolidate EAGLE-1 and EAGLE-2 support mentions, removing the explicit note about EAGLE-2's experimental status.
Performance Benchmarking Documentation `docs/source/performance/perf-benchmarking.md`	Removed the section describing the experimental mode for building TensorRT-LLM engines with target ISL/OSL values, including example commands and explanations.
Model Weights Loader Clarification `docs/source/architecture/model-weights-loader.md`	Clarified that the weights loader is enabled by default for LLaMA and Qwen models only when using the TensorRT flow, specifying the context more precisely.
FP8 Plugin Documentation Update `examples/models/core/llama/README.md`	Updated the explanation of FP8 GEMV/GEMM plugin usage: replaced "Experimental" with "Note," provided a more detailed technical explanation of FP8 gemv, and removed the warning about performance degradation for larger batch sizes.

Sequence Diagram(s)

No sequence diagrams are generated, as all changes are limited to documentation and do not affect control flow or feature implementation.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested labels

Documentation

Suggested reviewers

litaotju
syuoni

Note

⚡️ Unit Test Generation is now available in beta!

Learn morehere, or try it out under "Finishing Touches" below.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat withCodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag@coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag@coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on oursupport page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings togenerate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add@coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add@coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add@coderabbitai or@coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit ourDocumentation for detailed information on how to use CodeRabbit.
Join ourDiscord Community to get help, request features, and share feedback.
Follow us onX/Twitter for updates and announcements.

coderabbitaibot requested review fromlaikhtewari,schetlur-nv andzhuolingwang

July 28, 2025 17:02

coderabbitaibot reviewed

Jul 28, 2025

View reviewed changes

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)

docs/source/torch.md (1)

4-4:Re-phrase for a smoother reading flow
“launches a new backend” sounds like a one-off event. “introduces” (or “adds”) better reflects the documentation’s timeless nature.
-To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new backend based on PyTorch.+To enhance usability and developer efficiency, TensorRT-LLM introduces a new backend based on PyTorch.

docs/source/advanced/speculative-decoding.md (1)

171-171:Minor grammar & spacing tidy-up

Remove the redundant “of”, add the missing space, and swap the en-dash for a hyphen to stay consistent.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported).+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model so that logits prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported).

examples/models/core/llama/README.md (2)

679-679:Capitalise sentence start & tighten wording

-Note: use FP8 GEMV to optimize performance in FP8 small-batch-size cases.+Note: Use FP8 GEMV to optimise performance in small-batch-size FP8 scenarios.

697-697:Polish long explanatory note for readability

A few micro-fixes improve clarity:

-**Note**: FP8 gemv plugin uses CUDA cores to compute, by contrast to Tensor Core gemm kernel within cuBLAS. Over last year, as cuBLAS have improved their performance by a lot under small M case for Hopper(sm90), FP8 gemv kernel may or may not surpass cuBLAS, depending on specific gemm problem shape. Nonetheless, we still strongly recommend FP8 gemv kernel for Ada (sm89) as cuBLAS still falls behind gemv on it.+**Note**: The FP8 GEMV plugin runs on CUDA cores, whereas cuBLAS uses Tensor-Core GEMM kernels. Over the last year cuBLAS performance for small-M cases on Hopper (SM90) has improved substantially, so FP8 GEMV may or may not outperform cuBLAS depending on the exact GEMM shape. We still strongly recommend FP8 GEMV on Ada (SM89), where cuBLAS continues to lag behind.

docs/source/architecture/model-weights-loader.md (1)

252-252:Clarify “by default” clause
The current wording is slightly ambiguous about when the loader is active.
-The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only.+By default, the weights loader is enabled for LLaMA-family and Qwen models when using the TensorRT flow.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between60e4d3a andea7d44c.

📒 Files selected for processing (8)

docs/source/advanced/gpt-attention.md (0 hunks)
docs/source/advanced/speculative-decoding.md (1 hunks)
docs/source/architecture/model-weights-loader.md (1 hunks)
docs/source/performance/perf-benchmarking.md (0 hunks)
docs/source/reference/precision.md (1 hunks)
docs/source/torch.md (1 hunks)
examples/eagle/README.md (0 hunks)
examples/models/core/llama/README.md (2 hunks)

💤 Files with no reviewable changes (3)

examples/eagle/README.md
docs/source/advanced/gpt-attention.md
docs/source/performance/perf-benchmarking.md

🧰 Additional context used

🧠 Learnings (2)

docs/source/architecture/model-weights-loader.md (2)

Learnt from: amitz-nv
PR:#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checksis_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR:#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery usingfrom_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to callstrip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

docs/source/advanced/speculative-decoding.md (1)

🪛 LanguageTool

docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

docs/source/reference/precision.md OutdatedShow resolvedHide resolved

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch fromea7d44c todaa23b3Compare

July 28, 2025 17:14

coderabbitaibot reviewed

Jul 28, 2025

View reviewed changes

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/source/advanced/speculative-decoding.md (1)
171-171:Tighten wording and fix minor grammar issues
The sentence contains a few stylistic hiccups:
• “inside of” → “inside”
• missing space before the parenthesis after “engine”
• plural-singular mismatch in “draft tokens acceptance”
• superfluous comma after “Please”
Proposed tweak:
-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and betweenea7d44c anddaa23b3.

📒 Files selected for processing (8)

docs/source/advanced/gpt-attention.md (0 hunks)
docs/source/advanced/speculative-decoding.md (1 hunks)
docs/source/architecture/model-weights-loader.md (1 hunks)
docs/source/performance/perf-benchmarking.md (0 hunks)
docs/source/reference/precision.md (1 hunks)
docs/source/torch.md (1 hunks)
examples/eagle/README.md (0 hunks)
examples/models/core/llama/README.md (2 hunks)

💤 Files with no reviewable changes (3)

docs/source/advanced/gpt-attention.md
docs/source/performance/perf-benchmarking.md
examples/eagle/README.md

✅ Files skipped from review due to trivial changes (3)

docs/source/architecture/model-weights-loader.md
docs/source/reference/precision.md
docs/source/torch.md

🚧 Files skipped from review as they are similar to previous changes (1)

examples/models/core/llama/README.md

🧰 Additional context used

🧠 Learnings (1)

docs/source/advanced/speculative-decoding.md (2)

Learnt from: moraxu
PR:#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.598Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

🪛 LanguageTool

docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch fromdaa23b3 to7e67143Compare

July 28, 2025 17:22

coderabbitaibot requested review fromlitaotju andyizhang-nv

July 28, 2025 17:23

coderabbitaibot reviewed

Jul 28, 2025

View reviewed changes

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)

docs/source/advanced/speculative-decoding.md (2)

171-171:Tighten wording and fix minor formatting issues
The phrase “inside of the TensorRT engine” is redundant, there’s a missing space before the parenthesis, and “draft tokens acceptance” should be singular.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). For details on building and running the model, see the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md).

169-170:Correct typo and clarify sentences
“asssembled” has an extra “s”, and both sentences read more naturally with slight re-phrasing.

-In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam.+In EAGLE-1, the decoding tree must be defined before decoding begins. In EAGLE-2, this tree is assembled during execution by searching for the most probable hypothesis along the beam.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and betweendaa23b3 and7e67143.

📒 Files selected for processing (14)

README.md (1 hunks)
docs/source/advanced/disaggregated-service.md (1 hunks)
docs/source/advanced/gpt-attention.md (0 hunks)
docs/source/advanced/speculative-decoding.md (1 hunks)
docs/source/architecture/model-weights-loader.md (1 hunks)
docs/source/performance/perf-benchmarking.md (0 hunks)
docs/source/reference/precision.md (1 hunks)
docs/source/torch.md (1 hunks)
examples/auto_deploy/README.md (2 hunks)
examples/disaggregated/README.md (1 hunks)
examples/eagle/README.md (0 hunks)
examples/models/core/deepseek_v3/README.md (2 hunks)
examples/models/core/llama/README.md (2 hunks)
examples/sample_weight_stripping/README.md (2 hunks)

💤 Files with no reviewable changes (3)

examples/eagle/README.md
docs/source/advanced/gpt-attention.md
docs/source/performance/perf-benchmarking.md

✅ Files skipped from review due to trivial changes (9)

docs/source/advanced/disaggregated-service.md
examples/disaggregated/README.md
examples/auto_deploy/README.md
docs/source/reference/precision.md
README.md
examples/models/core/deepseek_v3/README.md
examples/sample_weight_stripping/README.md
docs/source/architecture/model-weights-loader.md
docs/source/torch.md

🚧 Files skipped from review as they are similar to previous changes (1)

examples/models/core/llama/README.md

🧰 Additional context used

🧠 Learnings (1)

docs/source/advanced/speculative-decoding.md (2)

🪛 LanguageTool

docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch 2 times, most recently fromb569864 to372fd83Compare

July 29, 2025 03:11

Copy link

CollaboratorAuthor

nv-guomingz commentedJul 29, 2025

/bot run --stage-list "A10-Build_Docs"

coderabbitaibot reviewed

Jul 29, 2025

View reviewed changes

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/source/advanced/speculative-decoding.md (1)
171-171:Polish grammar & spacing for clarity
Minor wording and punctuation tweaks improve readability:
-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logit prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
Key fixes:
• “inside the” instead of “inside of the”.
• Add space before the parenthesis.
• Oxford comma after “acceptance”.
• Singular “logit” and hyphenate compound nouns.
• Remove superfluous comma after “Please”.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between7e67143 and372fd83.

📒 Files selected for processing (14)

README.md (1 hunks)
docs/source/advanced/disaggregated-service.md (1 hunks)
docs/source/advanced/gpt-attention.md (0 hunks)
docs/source/advanced/speculative-decoding.md (1 hunks)
docs/source/architecture/model-weights-loader.md (1 hunks)
docs/source/performance/perf-benchmarking.md (0 hunks)
docs/source/reference/precision.md (1 hunks)
docs/source/torch.md (1 hunks)
examples/auto_deploy/README.md (2 hunks)
examples/disaggregated/README.md (1 hunks)
examples/eagle/README.md (0 hunks)
examples/models/core/deepseek_v3/README.md (2 hunks)
examples/models/core/llama/README.md (2 hunks)
examples/sample_weight_stripping/README.md (2 hunks)

💤 Files with no reviewable changes (3)

docs/source/advanced/gpt-attention.md
examples/eagle/README.md
docs/source/performance/perf-benchmarking.md

✅ Files skipped from review due to trivial changes (8)

README.md
docs/source/advanced/disaggregated-service.md
examples/sample_weight_stripping/README.md
docs/source/architecture/model-weights-loader.md
examples/disaggregated/README.md
docs/source/reference/precision.md
examples/auto_deploy/README.md
examples/models/core/deepseek_v3/README.md

🚧 Files skipped from review as they are similar to previous changes (2)

examples/models/core/llama/README.md
docs/source/torch.md

🧰 Additional context used

🧠 Learnings (1)

docs/source/advanced/speculative-decoding.md (2)

Learnt from: moraxu
PR:#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

🪛 LanguageTool

docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

Copy link

Collaborator

tensorrt-cicd commentedJul 29, 2025

PR_Github #13281 [ run ] triggered by Bot

Copy link

Collaborator

tensorrt-cicd commentedJul 29, 2025

PR_Github #13281 [ run ] completed with stateFAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9920(Partly Tested) completed with status: 'FAILURE'

FrankD412 reviewed

Jul 29, 2025

View reviewed changes

docs/source/performance/perf-benchmarking.mdShow resolvedHide resolved

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch from372fd83 tof0fe05cCompare

August 6, 2025 05:29

nv-guomingz requested a review froma team as acode owner

August 6, 2025 05:29

coderabbitaibot reviewed

Aug 6, 2025

View reviewed changes

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)

docs/source/advanced/speculative-decoding.md (1)
171-171:Drop “of” after “inside” and fix missing space before parenthesis
Small wording/formatting tweaks improve readability.
-... performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported).+... performed inside the TensorRT engine (both EAGLE-1 and EAGLE-2 are supported).
examples/disaggregated/README.md (1)
112-116:Fix typo in YAML keyrefresh_interval
refersh_interval is miss-spelled. Anyone copying this sample will hit a configuration error.
-  refersh_interval: 10.0+  refresh_interval: 10.0

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between372fd83 andf0fe05c.

📒 Files selected for processing (14)

README.md (1 hunks)
docs/source/advanced/disaggregated-service.md (1 hunks)
docs/source/advanced/gpt-attention.md (0 hunks)
docs/source/advanced/speculative-decoding.md (1 hunks)
docs/source/architecture/model-weights-loader.md (1 hunks)
docs/source/performance/perf-benchmarking.md (0 hunks)
docs/source/reference/precision.md (1 hunks)
docs/source/torch.md (1 hunks)
examples/auto_deploy/README.md (2 hunks)
examples/disaggregated/README.md (1 hunks)
examples/eagle/README.md (0 hunks)
examples/models/core/deepseek_v3/README.md (2 hunks)
examples/models/core/llama/README.md (2 hunks)
examples/sample_weight_stripping/README.md (2 hunks)

💤 Files with no reviewable changes (3)

docs/source/advanced/gpt-attention.md
examples/eagle/README.md
docs/source/performance/perf-benchmarking.md

✅ Files skipped from review due to trivial changes (7)

docs/source/advanced/disaggregated-service.md
examples/auto_deploy/README.md
examples/models/core/deepseek_v3/README.md
README.md
docs/source/reference/precision.md
docs/source/torch.md
docs/source/architecture/model-weights-loader.md

🚧 Files skipped from review as they are similar to previous changes (2)

examples/sample_weight_stripping/README.md
examples/models/core/llama/README.md

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...

Learnt from: yibinl-nvidiaPR: NVIDIA/TensorRT-LLM#6506File: examples/models/core/mixtral/requirements.txt:3-3Timestamp: 2025-08-01T15:14:45.673ZLearning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/advanced/speculative-decoding.md

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidiaPR: NVIDIA/TensorRT-LLM#6254File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204Timestamp: 2025-07-22T09:22:14.726ZLearning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

docs/source/advanced/speculative-decoding.md

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/advanced/speculative-decoding.md

🪛 LanguageTool

docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

🪛 markdownlint-cli2 (0.17.2)

examples/disaggregated/README.md

86-86: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

86-86: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

examples/disaggregated/README.md (1)
86-86:Status label update looks good
The heading change from “Experimental” to “Prototype” accurately reflects the new maturity stage and keeps terminology consistent across the docs.

laikhtewari approved these changes

Aug 6, 2025

View reviewed changes

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch fromf0fe05c to3aa3446Compare

August 6, 2025 16:18

Copy link

CollaboratorAuthor

nv-guomingz commentedAug 6, 2025

/bot skip --comment "docs only change"

nv-guomingzenabled auto-merge (squash)

August 6, 2025 16:18

coderabbitaibot reviewed

Aug 6, 2025

View reviewed changes

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)

docs/source/advanced/speculative-decoding.md (1)
171-171:Tighten grammar & spacing for clarity
Minor wording polish:
• “inside of” → “inside” (redundant “of”).
• Insert Oxford comma after “acceptance”.
• Add space before the opening parenthesis.
• Drop comma after “Please”.
-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
examples/disaggregated/README.md (2)
110-116:Fix typo in key name –refresh_interval
refersh_interval will confuse users who copy-paste the YAML and may break config loaders that validate keys.
-  refersh_interval: 10.0+  refresh_interval: 10.0
181-183:Correct section title – “Known Issues”
Minor wording nit:
-## Know Issues+## Known Issues
This keeps terminology consistent across the docs.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and betweenf0fe05c and3aa3446.

📒 Files selected for processing (14)

README.md (1 hunks)
docs/source/advanced/disaggregated-service.md (1 hunks)
docs/source/advanced/gpt-attention.md (0 hunks)
docs/source/advanced/speculative-decoding.md (1 hunks)
docs/source/architecture/model-weights-loader.md (1 hunks)
docs/source/performance/perf-benchmarking.md (0 hunks)
docs/source/reference/precision.md (1 hunks)
docs/source/torch.md (1 hunks)
examples/auto_deploy/README.md (2 hunks)
examples/disaggregated/README.md (1 hunks)
examples/eagle/README.md (0 hunks)
examples/models/core/deepseek_v3/README.md (2 hunks)
examples/models/core/llama/README.md (2 hunks)
examples/sample_weight_stripping/README.md (2 hunks)

💤 Files with no reviewable changes (3)

docs/source/advanced/gpt-attention.md
examples/eagle/README.md
docs/source/performance/perf-benchmarking.md

✅ Files skipped from review due to trivial changes (7)

docs/source/architecture/model-weights-loader.md
README.md
examples/models/core/deepseek_v3/README.md
docs/source/torch.md
docs/source/advanced/disaggregated-service.md
examples/auto_deploy/README.md
docs/source/reference/precision.md

🚧 Files skipped from review as they are similar to previous changes (2)

examples/sample_weight_stripping/README.md
examples/models/core/llama/README.md

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...

Learnt from: yibinl-nvidiaPR: NVIDIA/TensorRT-LLM#6506File: examples/models/core/mixtral/requirements.txt:3-3Timestamp: 2025-08-01T15:14:45.673ZLearning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/advanced/speculative-decoding.md

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidiaPR: NVIDIA/TensorRT-LLM#6254File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204Timestamp: 2025-07-22T09:22:14.726ZLearning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

docs/source/advanced/speculative-decoding.md

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/advanced/speculative-decoding.md

🪛 LanguageTool

docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

🪛 markdownlint-cli2 (0.17.2)

examples/disaggregated/README.md

86-86: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

86-86: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

🔇 Additional comments (1)

examples/disaggregated/README.md (1)
86-86:Verify downstream links & anchor IDs after heading rename
Changing the heading from “Dynamic scaling (Experimental)” to “Dynamic scaling (Prototype)” also changes the automatic Markdown anchor (GitHub →#dynamic-scaling-prototype).
Please grep the repo (docs, blogs, notebooks) for#dynamic-scaling-experimental and update any in-page links or TOC generators that still point to the old anchor.

Copy link

Collaborator

tensorrt-cicd commentedAug 6, 2025

PR_Github #14318 [ skip ] triggered by Bot

Copy link

Collaborator

tensorrt-cicd commentedAug 6, 2025

PR_Github #14318 [ skip ] completed with stateSUCCESS
Skipping testing for commit3aa3446

nv-guomingz changed the title~~doc: remove the outdated features which marked as Experimental~~[None][doc]: remove the outdated features which marked as Experimental

Aug 7, 2025

doc: update the outdated features status.

605a8c6

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz force-pushed theuser/guomingz/clean_experimental branch from3aa3446 to605a8c6Compare

August 7, 2025 01:40

Copy link

CollaboratorAuthor

nv-guomingz commentedAug 7, 2025

/bot skip --comment "docs only change"

coderabbitaibot reviewed

Aug 7, 2025

View reviewed changes

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/source/advanced/speculative-decoding.md (1)

171-171:Remove redundant “of” and clean up punctuation

Minor wording and punctuation polish for clarity and consistency.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (both EAGLE-1 and EAGLE-2 are supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between3aa3446 and605a8c6.

📒 Files selected for processing (14)

README.md (1 hunks)
docs/source/advanced/disaggregated-service.md (1 hunks)
docs/source/advanced/gpt-attention.md (0 hunks)
docs/source/advanced/speculative-decoding.md (1 hunks)
docs/source/architecture/model-weights-loader.md (1 hunks)
docs/source/performance/perf-benchmarking.md (0 hunks)
docs/source/reference/precision.md (1 hunks)
docs/source/torch.md (1 hunks)
examples/auto_deploy/README.md (2 hunks)
examples/disaggregated/README.md (1 hunks)
examples/eagle/README.md (0 hunks)
examples/models/core/deepseek_v3/README.md (2 hunks)
examples/models/core/llama/README.md (2 hunks)
examples/sample_weight_stripping/README.md (2 hunks)

💤 Files with no reviewable changes (3)

docs/source/advanced/gpt-attention.md
examples/eagle/README.md
docs/source/performance/perf-benchmarking.md

✅ Files skipped from review due to trivial changes (7)

examples/disaggregated/README.md
docs/source/advanced/disaggregated-service.md
examples/models/core/deepseek_v3/README.md
README.md
docs/source/architecture/model-weights-loader.md
examples/auto_deploy/README.md
docs/source/reference/precision.md

🚧 Files skipped from review as they are similar to previous changes (3)

examples/sample_weight_stripping/README.md
docs/source/torch.md
examples/models/core/llama/README.md

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...

Learnt from: yibinl-nvidiaPR: NVIDIA/TensorRT-LLM#6506File: examples/models/core/mixtral/requirements.txt:3-3Timestamp: 2025-08-01T15:14:45.673ZLearning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/advanced/speculative-decoding.md

📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...

Learnt from: yechank-nvidiaPR: NVIDIA/TensorRT-LLM#6254File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204Timestamp: 2025-07-22T09:22:14.726ZLearning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

docs/source/advanced/speculative-decoding.md

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/advanced/speculative-decoding.md

🪛 LanguageTool

docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

Shixiaowei02 approved these changes

Aug 7, 2025

View reviewed changes

Copy link

Collaborator

tensorrt-cicd commentedAug 7, 2025

PR_Github #14351 [ skip ] triggered by Bot

Copy link

Collaborator

tensorrt-cicd commentedAug 7, 2025

PR_Github #14351 [ skip ] completed with stateSUCCESS
Skipping testing for commit605a8c6

nv-guomingz merged commitf7f46a5 intoNVIDIA:main

Aug 7, 2025

3 of 4 checks passed

nv-guomingz added a commit to nv-guomingz/TensorRT-LLM that referenced this pull request

Aug 7, 2025

doc: remove the outdated features which marked as Experimental (NVIDI…

e2b6edb

…A#5995)Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz mentioned this pull request

Aug 7, 2025

[None][doc]: remove the outdated features which marked as Experimental (#5995)#6681

Closed

coderabbitaibot mentioned this pull request

Aug 7, 2025

[None][doc] Move AutoDeploy README.md to torch docs#6528

Merged

coderabbitaibot mentioned this pull request

Aug 25, 2025

[None][doc] Update autodeploy README.md, deprecate lm_eval in examples folder#7233

Merged

nv-guomingz deleted the user/guomingz/clean_experimental branch

September 30, 2025 07:46

Reviewers

FrankD412FrankD412 left review comments

coderabbitai[bot]coderabbitai[bot] left review comments

laikhtewarilaikhtewari approved these changes

yweng0828yweng0828 approved these changes

Shixiaowei02Shixiaowei02 approved these changes

Barry-DelaneyBarry-Delaney approved these changes

lowsferAwaiting requested review from lowsfer

QiJuneAwaiting requested review from QiJune

lucaslieAwaiting requested review from lucaslielucaslie is a code owner automatically assigned from NVIDIA/trtllm-bench-reviewers

kaiyuxAwaiting requested review from kaiyux

schetlur-nvAwaiting requested review from schetlur-nv

zhuolingwangAwaiting requested review from zhuolingwang

litaotjuAwaiting requested review from litaotju

yizhang-nvAwaiting requested review from yizhang-nv

+1 more reviewer

NjuappNjuapp left review comments

Reviewers whose approvals may not affect merge requirements

Labels

None yet

Movatterモバイル変換

[None][doc]: remove the outdated features which marked as Experimental#5995

[None][doc]: remove the outdated features which marked as Experimental#5995

Uh oh!

Conversation

nv-guomingz commentedJul 14, 2025• edited by coderabbitaibotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

Uh oh!

coderabbitaibot commentedJul 28, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitaibot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitaibot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitaibot left a comment

Choose a reason for hiding this comment

Uh oh!

nv-guomingz commentedJul 29, 2025

Uh oh!

coderabbitaibot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commentedJul 29, 2025

Uh oh!

tensorrt-cicd commentedJul 29, 2025

Uh oh!

Uh oh!

coderabbitaibot left a comment

Choose a reason for hiding this comment

Uh oh!

nv-guomingz commentedAug 6, 2025

Uh oh!

coderabbitaibot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commentedAug 6, 2025

Uh oh!

tensorrt-cicd commentedAug 6, 2025

Uh oh!

nv-guomingz commentedAug 7, 2025

Uh oh!

coderabbitaibot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commentedAug 7, 2025

Uh oh!

tensorrt-cicd commentedAug 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

nv-guomingz commentedJul 14, 2025•
edited by coderabbitaibot
Loading

coderabbitaibot commentedJul 28, 2025•
edited
Loading