Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[None][doc]: remove the outdated features which marked as Experimental#5995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

@nv-guomingz
Copy link
Collaborator

@nv-guomingznv-guomingz commentedJul 14, 2025
edited by coderabbitaibot
Loading

Clean the doc by removing experimental label.

  • PyTorch Backend (Experimental --> Beta)
  • Disagg-serving (Experimental --> Prototype)
  • AutoDeploy (Experimental --> Prototype)
  • Use tensorrtllm_backend for triton inference server (Experimental --> Prototype)

Summary by CodeRabbit

Summary by CodeRabbit

  • Documentation
    • Removed references to features and techniques being "experimental" or subject to change across multiple documentation pages and READMEs.
    • Clarified default behavior and support contexts for specific features in the documentation.
    • Updated explanations and recommendations for FP8 GEMV/GEMM plugin usage, providing more detail and clearer guidance.
    • Simplified or removed descriptions of deprecated or experimental build modes and configuration options.
    • Updated feature status descriptions from "experimental" to "prototype" or "beta" in various documentation and example READMEs.

@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch from107dbb3 toa808cc8CompareJuly 14, 2025 08:39
@nv-guomingznv-guomingz requested a review froma team as acode ownerJuly 14, 2025 08:39
@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch froma808cc8 tod69b27eCompareJuly 14, 2025 08:48
@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch fromd69b27e to909bcb1CompareJuly 14, 2025 08:56
@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch from909bcb1 toe3f1e8cCompareJuly 14, 2025 11:37
@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch 2 times, most recently fromcc18db1 toc6a80d1CompareJuly 14, 2025 14:08
@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch fromc6a80d1 toea7d44cCompareJuly 28, 2025 17:02
@coderabbitai
Copy link
Contributor

coderabbitaibot commentedJul 28, 2025
edited
Loading

📝 Walkthrough

Walkthrough

This update modifies documentation files to remove or reword references to "experimental" status for several features, clarify default behaviors, and update technical explanations. No changes to code or public interfaces are present; all modifications are limited to documentation content and README files.

Changes

Cohort / File(s)Change Summary
Experimental Status Removal (General)
docs/source/advanced/gpt-attention.md,docs/source/torch.md,examples/eagle/README.md,docs/source/reference/precision.md,README.md,docs/source/advanced/disaggregated-service.md,examples/auto_deploy/README.md,examples/disaggregated/README.md,examples/models/core/deepseek_v3/README.md,examples/sample_weight_stripping/README.md
Removed or replaced references to features being "experimental" with "prototype" or "beta" status for XQA optimization, PyTorch backend, EAGLE-2, quantization examples, AutoDeploy backend, disaggregated service, dynamic scaling, tensorrtllm_backend for Triton, and sample weight stripping. No functional changes made.
Speculative Decoding Documentation
docs/source/advanced/speculative-decoding.md
Reworded the description of EAGLE speculative decoding to consolidate EAGLE-1 and EAGLE-2 support mentions, removing the explicit note about EAGLE-2's experimental status.
Performance Benchmarking Documentation
docs/source/performance/perf-benchmarking.md
Removed the section describing the experimental mode for building TensorRT-LLM engines with target ISL/OSL values, including example commands and explanations.
Model Weights Loader Clarification
docs/source/architecture/model-weights-loader.md
Clarified that the weights loader is enabled by default for LLaMA and Qwen models only when using the TensorRT flow, specifying the context more precisely.
FP8 Plugin Documentation Update
examples/models/core/llama/README.md
Updated the explanation of FP8 GEMV/GEMM plugin usage: replaced "Experimental" with "Note," provided a more detailed technical explanation of FP8 gemv, and removed the warning about performance degradation for larger batch sizes.

Sequence Diagram(s)

No sequence diagrams are generated, as all changes are limited to documentation and do not affect control flow or feature implementation.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested labels

Documentation

Suggested reviewers

  • litaotju
  • syuoni

Note

⚡️ Unit Test Generation is now available in beta!

Learn morehere, or try it out under "Finishing Touches" below.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat withCodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag@coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag@coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on oursupport page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings togenerate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add@coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add@coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add@coderabbitai or@coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit ourDocumentation for detailed information on how to use CodeRabbit.
  • Join ourDiscord Community to get help, request features, and share feedback.
  • Follow us onX/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
docs/source/torch.md (1)

4-4:Re-phrase for a smoother reading flow

“launches a new backend” sounds like a one-off event. “introduces” (or “adds”) better reflects the documentation’s timeless nature.

-To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new backend based on PyTorch.+To enhance usability and developer efficiency, TensorRT-LLM introduces a new backend based on PyTorch.
docs/source/advanced/speculative-decoding.md (1)

171-171:Minor grammar & spacing tidy-up

Remove the redundant “of”, add the missing space, and swap the en-dash for a hyphen to stay consistent.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported).+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model so that logits prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported).
examples/models/core/llama/README.md (2)

679-679:Capitalise sentence start & tighten wording

-Note: use FP8 GEMV to optimize performance in FP8 small-batch-size cases.+Note: Use FP8 GEMV to optimise performance in small-batch-size FP8 scenarios.

697-697:Polish long explanatory note for readability

A few micro-fixes improve clarity:

-**Note**: FP8 gemv plugin uses CUDA cores to compute, by contrast to Tensor Core gemm kernel within cuBLAS. Over last year, as cuBLAS have improved their performance by a lot under small M case for Hopper(sm90), FP8 gemv kernel may or may not surpass cuBLAS, depending on specific gemm problem shape. Nonetheless, we still strongly recommend FP8 gemv kernel for Ada (sm89) as cuBLAS still falls behind gemv on it.+**Note**: The FP8 GEMV plugin runs on CUDA cores, whereas cuBLAS uses Tensor-Core GEMM kernels. Over the last year cuBLAS performance for small-M cases on Hopper (SM90) has improved substantially, so FP8 GEMV may or may not outperform cuBLAS depending on the exact GEMM shape. We still strongly recommend FP8 GEMV on Ada (SM89), where cuBLAS continues to lag behind.
docs/source/architecture/model-weights-loader.md (1)

252-252:Clarify “by default” clause

The current wording is slightly ambiguous about when the loader is active.

-The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only.+By default, the weights loader is enabled for LLaMA-family and Qwen models when using the TensorRT flow.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between60e4d3a andea7d44c.

📒 Files selected for processing (8)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/llama/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • examples/eagle/README.md
  • docs/source/advanced/gpt-attention.md
  • docs/source/performance/perf-benchmarking.md
🧰 Additional context used
🧠 Learnings (2)
docs/source/architecture/model-weights-loader.md (2)

Learnt from: amitz-nv
PR:#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checksis_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR:#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery usingfrom_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to callstrip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

docs/source/advanced/speculative-decoding.md (1)

Learnt from: yechank-nvidia
PR:#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery usingfrom_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to callstrip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch fromea7d44c todaa23b3CompareJuly 28, 2025 17:14
Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/source/advanced/speculative-decoding.md (1)

171-171:Tighten wording and fix minor grammar issues

The sentence contains a few stylistic hiccups:
• “inside of” → “inside”
• missing space before the parenthesis after “engine”
• plural-singular mismatch in “draft tokens acceptance”
• superfluous comma after “Please”

Proposed tweak:

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and betweenea7d44c anddaa23b3.

📒 Files selected for processing (8)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/llama/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • docs/source/performance/perf-benchmarking.md
  • examples/eagle/README.md
✅ Files skipped from review due to trivial changes (3)
  • docs/source/architecture/model-weights-loader.md
  • docs/source/reference/precision.md
  • docs/source/torch.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/speculative-decoding.md (2)

Learnt from: yechank-nvidia
PR:#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery usingfrom_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to callstrip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: moraxu
PR:#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.598Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch fromdaa23b3 to7e67143CompareJuly 28, 2025 17:22
Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
docs/source/advanced/speculative-decoding.md (2)

171-171:Tighten wording and fix minor formatting issues
The phrase “inside of the TensorRT engine” is redundant, there’s a missing space before the parenthesis, and “draft tokens acceptance” should be singular.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). For details on building and running the model, see the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md).

169-170:Correct typo and clarify sentences
“asssembled” has an extra “s”, and both sentences read more naturally with slight re-phrasing.

-In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam.+In EAGLE-1, the decoding tree must be defined before decoding begins. In EAGLE-2, this tree is assembled during execution by searching for the most probable hypothesis along the beam.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and betweendaa23b3 and7e67143.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • examples/eagle/README.md
  • docs/source/advanced/gpt-attention.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (9)
  • docs/source/advanced/disaggregated-service.md
  • examples/disaggregated/README.md
  • examples/auto_deploy/README.md
  • docs/source/reference/precision.md
  • README.md
  • examples/models/core/deepseek_v3/README.md
  • examples/sample_weight_stripping/README.md
  • docs/source/architecture/model-weights-loader.md
  • docs/source/torch.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/speculative-decoding.md (2)

Learnt from: yechank-nvidia
PR:#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery usingfrom_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to callstrip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: moraxu
PR:#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.598Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch 2 times, most recently fromb569864 to372fd83CompareJuly 29, 2025 03:11
@nv-guomingz
Copy link
CollaboratorAuthor

/bot run --stage-list "A10-Build_Docs"

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/source/advanced/speculative-decoding.md (1)

171-171:Polish grammar & spacing for clarity

Minor wording and punctuation tweaks improve readability:

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logit prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.

Key fixes:
• “inside the” instead of “inside of the”.
• Add space before the parenthesis.
• Oxford comma after “acceptance”.
• Singular “logit” and hyphenate compound nouns.
• Remove superfluous comma after “Please”.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between7e67143 and372fd83.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • examples/eagle/README.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (8)
  • README.md
  • docs/source/advanced/disaggregated-service.md
  • examples/sample_weight_stripping/README.md
  • docs/source/architecture/model-weights-loader.md
  • examples/disaggregated/README.md
  • docs/source/reference/precision.md
  • examples/auto_deploy/README.md
  • examples/models/core/deepseek_v3/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/models/core/llama/README.md
  • docs/source/torch.md
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/speculative-decoding.md (2)

Learnt from: yechank-nvidia
PR:#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery usingfrom_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to callstrip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: moraxu
PR:#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13281 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13281 [ run ] completed with stateFAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9920(Partly Tested) completed with status: 'FAILURE'

@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch from372fd83 tof0fe05cCompareAugust 6, 2025 05:29
@nv-guomingznv-guomingz requested a review froma team as acode ownerAugust 6, 2025 05:29
Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
docs/source/advanced/speculative-decoding.md (1)

171-171:Drop “of” after “inside” and fix missing space before parenthesis

Small wording/formatting tweaks improve readability.

-... performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported).+... performed inside the TensorRT engine (both EAGLE-1 and EAGLE-2 are supported).
examples/disaggregated/README.md (1)

112-116:Fix typo in YAML keyrefresh_interval
refersh_interval is miss-spelled. Anyone copying this sample will hit a configuration error.

-  refersh_interval: 10.0+  refresh_interval: 10.0
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between372fd83 andf0fe05c.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • examples/eagle/README.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (7)
  • docs/source/advanced/disaggregated-service.md
  • examples/auto_deploy/README.md
  • examples/models/core/deepseek_v3/README.md
  • README.md
  • docs/source/reference/precision.md
  • docs/source/torch.md
  • docs/source/architecture/model-weights-loader.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/sample_weight_stripping/README.md
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidiaPR: NVIDIA/TensorRT-LLM#6506File: examples/models/core/mixtral/requirements.txt:3-3Timestamp: 2025-08-01T15:14:45.673ZLearning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidiaPR: NVIDIA/TensorRT-LLM#6254File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204Timestamp: 2025-07-22T09:22:14.726ZLearning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

🪛 markdownlint-cli2 (0.17.2)
examples/disaggregated/README.md

86-86: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


86-86: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
examples/disaggregated/README.md (1)

86-86:Status label update looks good
The heading change from “Experimental” to “Prototype” accurately reflects the new maturity stage and keeps terminology consistent across the docs.

@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch fromf0fe05c to3aa3446CompareAugust 6, 2025 16:18
@nv-guomingz
Copy link
CollaboratorAuthor

/bot skip --comment "docs only change"

@nv-guomingznv-guomingzenabled auto-merge (squash)August 6, 2025 16:18
Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docs/source/advanced/speculative-decoding.md (1)

171-171:Tighten grammar & spacing for clarity

Minor wording polish:
• “inside of” → “inside” (redundant “of”).
• Insert Oxford comma after “acceptance”.
• Add space before the opening parenthesis.
• Drop comma after “Please”.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
examples/disaggregated/README.md (2)

110-116:Fix typo in key name –refresh_interval

refersh_interval will confuse users who copy-paste the YAML and may break config loaders that validate keys.

-  refersh_interval: 10.0+  refresh_interval: 10.0

181-183:Correct section title – “Known Issues”

Minor wording nit:

-## Know Issues+## Known Issues

This keeps terminology consistent across the docs.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and betweenf0fe05c and3aa3446.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • examples/eagle/README.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (7)
  • docs/source/architecture/model-weights-loader.md
  • README.md
  • examples/models/core/deepseek_v3/README.md
  • docs/source/torch.md
  • docs/source/advanced/disaggregated-service.md
  • examples/auto_deploy/README.md
  • docs/source/reference/precision.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/sample_weight_stripping/README.md
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidiaPR: NVIDIA/TensorRT-LLM#6506File: examples/models/core/mixtral/requirements.txt:3-3Timestamp: 2025-08-01T15:14:45.673ZLearning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidiaPR: NVIDIA/TensorRT-LLM#6254File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204Timestamp: 2025-07-22T09:22:14.726ZLearning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

🪛 markdownlint-cli2 (0.17.2)
examples/disaggregated/README.md

86-86: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


86-86: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

🔇 Additional comments (1)
examples/disaggregated/README.md (1)

86-86:Verify downstream links & anchor IDs after heading rename

Changing the heading from “Dynamic scaling (Experimental)” to “Dynamic scaling (Prototype)” also changes the automatic Markdown anchor (GitHub →#dynamic-scaling-prototype).
Please grep the repo (docs, blogs, notebooks) for#dynamic-scaling-experimental and update any in-page links or TOC generators that still point to the old anchor.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14318 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14318 [ skip ] completed with stateSUCCESS
Skipping testing for commit3aa3446

@nv-guomingznv-guomingz changed the titledoc: remove the outdated features which marked as Experimental[None][doc]: remove the outdated features which marked as ExperimentalAug 7, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
@nv-guomingznv-guomingzforce-pushed theuser/guomingz/clean_experimental branch from3aa3446 to605a8c6CompareAugust 7, 2025 01:40
@nv-guomingz
Copy link
CollaboratorAuthor

/bot skip --comment "docs only change"

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/source/advanced/speculative-decoding.md (1)

171-171:Remove redundant “of” and clean up punctuation

Minor wording and punctuation polish for clarity and consistency.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (both EAGLE-1 and EAGLE-2 are supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between3aa3446 and605a8c6.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • examples/eagle/README.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (7)
  • examples/disaggregated/README.md
  • docs/source/advanced/disaggregated-service.md
  • examples/models/core/deepseek_v3/README.md
  • README.md
  • docs/source/architecture/model-weights-loader.md
  • examples/auto_deploy/README.md
  • docs/source/reference/precision.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • examples/sample_weight_stripping/README.md
  • docs/source/torch.md
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidiaPR: NVIDIA/TensorRT-LLM#6506File: examples/models/core/mixtral/requirements.txt:3-3Timestamp: 2025-08-01T15:14:45.673ZLearning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidiaPR: NVIDIA/TensorRT-LLM#6254File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204Timestamp: 2025-07-22T09:22:14.726ZLearning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14351 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14351 [ skip ] completed with stateSUCCESS
Skipping testing for commit605a8c6

@nv-guomingznv-guomingz merged commitf7f46a5 intoNVIDIA:mainAug 7, 2025
3 of 4 checks passed
nv-guomingz added a commit to nv-guomingz/TensorRT-LLM that referenced this pull requestAug 7, 2025
…A#5995)Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
@nv-guomingznv-guomingz deleted the user/guomingz/clean_experimental branchSeptember 30, 2025 07:46
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@FrankD412FrankD412FrankD412 left review comments

@coderabbitaicoderabbitai[bot]coderabbitai[bot] left review comments

@laikhtewarilaikhtewarilaikhtewari approved these changes

@yweng0828yweng0828yweng0828 approved these changes

@Shixiaowei02Shixiaowei02Shixiaowei02 approved these changes

@Barry-DelaneyBarry-DelaneyBarry-Delaney approved these changes

@lowsferlowsferAwaiting requested review from lowsfer

@QiJuneQiJuneAwaiting requested review from QiJune

@lucaslielucaslieAwaiting requested review from lucaslielucaslie is a code owner automatically assigned from NVIDIA/trtllm-bench-reviewers

@kaiyuxkaiyuxAwaiting requested review from kaiyux

@schetlur-nvschetlur-nvAwaiting requested review from schetlur-nv

@zhuolingwangzhuolingwangAwaiting requested review from zhuolingwang

@litaotjulitaotjuAwaiting requested review from litaotju

@yizhang-nvyizhang-nvAwaiting requested review from yizhang-nv

+1 more reviewer

@NjuappNjuappNjuapp left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

8 participants

@nv-guomingz@tensorrt-cicd@FrankD412@laikhtewari@Njuapp@yweng0828@Shixiaowei02@Barry-Delaney

[8]ページ先頭

©2009-2025 Movatter.jp