Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156)#7294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
dcampora merged 1 commit intoNVIDIA:mainfromixlmar:feat/batched-sampling
Sep 23, 2025

Conversation

@ixlmar
Copy link
Collaborator

@ixlmarixlmar commentedAug 27, 2025
edited by coderabbitaibot
Loading

Description

  • This PR removes theenable_mixed_sampler option, including the "fast_path" code path controlled by it. Meanwhile, the changes aim at improving the performance of the general code path to avoid any significant performance degradation in the case of greedy sampling without drafting, previously covered by "fast_path".

  • Removingenable_mixed_sampler should also avoid non-intuitive behavior, as evidenced inInvestigate TRTLLM runtime repetitive issue #5254.

  • However, the revised code is designed to also streamline request processing via batching for non-greedy sampling, as well as in the presence of draft tokens --- i.e., in situations previously not covered by "fast_path".

  • This PR opportunistically makes changes to vectorizeTorchSampler._apply_embedding_bias.

  • This PR opportunistically makes changes to skip softmax computation when it is not needed (e.g., for greedy sampling).

  • Add@torch.inference_mode() inTorchSampler.

  • Lastly, the PR includes some minor code improvements (type annotations etc.).

Test Coverage

  • Updated existing tests to account forenable_mixed_sampler option removal.
  • Since the PR contains mostly non-functional changes (performance and general code improvements), existing test coverage is assumed to be sufficient.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run/bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: DoesNOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: DoesNOT update GitHub check status.

--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: DoesNOT update GitHub check status.

--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: DoesNOT update GitHub pipeline status.

--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: DoesNOT update GitHub check status.

--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: DoesNOT update GitHub check status.

--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug(OPTIONAL) :Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list parameter to access the appropriate container environment. Note: DoesNOT update GitHub check status.

For guidance on mapping tests to stage names, seedocs/source/reference/ci-overview.md
and thescripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request.--comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Batched sampling for parallel request decoding and a combined Top-K+Top-P sampling strategy with per-request dispatch.
  • Refactor

    • Removed mixed-sampler user configuration and related flags from public configs and init paths.
    • Streamlined sampler, request-drafting, and logits-handling pipelines; per-request draft-logit policy is now sampler-driven.
    • RMS normalization now returns residual only when explicitly requested.
  • Tests

    • Unit/integration test IDs and parameterizations updated to match the new sampling flow.
  • Chores

    • Minimum Python requirement raised to 3.10.

@coderabbitai
Copy link
Contributor

coderabbitaibot commentedAug 27, 2025
edited
Loading

📝 Walkthrough

Walkthrough

Removed the publicenable_mixed_sampler flag across configs and call sites; refactored Torch sampler into a batched, per-strategy sampling pipeline with new index translators andsample() APIs; ModelDrafter always routes logits through the sampler and defers draft-prob policy to the sampler; RMSNorm replaces Ellipsis-based residual with an explicit sentinel.

Changes

Cohort / File(s)Summary
Removeenable_mixed_sampler option
tensorrt_llm/_torch/auto_deploy/llm_args.py,tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py,tensorrt_llm/_torch/pyexecutor/_util.py,tensorrt_llm/_torch/pyexecutor/config.py,tensorrt_llm/llmapi/llm_args.py,tensorrt_llm/scaffolding/worker.py,tests/unittest/llmapi/test_llm_pytorch.py
Deletesenable_mixed_sampler field/argument across public configs, executor/util wiring, scaffolding, and test parameterization; call sites updated to omit the flag;TorchSampler.Args no longer includes it.
Sampler batched refactor & API additions
tensorrt_llm/_torch/pyexecutor/sampler.py
Major refactor adding per-strategy grouping, batched sampling helpers (torch_multi_arange, StepIndexTranslator family),_BatchedSamplingResult, newsample() API, embedding-bias application, batched pack/unpack flow, newStrategy variantTopKTopP,should_provide_draft_probs, and removal ofTorchSampler.Args.enable_mixed_sampler.
Speculative decoding / drafter integration
tensorrt_llm/_torch/speculative/model_drafter.py,tensorrt_llm/_torch/speculative/eagle3.py
ModelDrafter always routes logits through the sampler, builds draftLlmRequest withexclude_last_generation_logits=True, usessampler.should_provide_draft_probs(request) for draft-prob policy, and adds TYPE_CHECKING imports for type-only hints.
RMSNorm residual sentinel change
tensorrt_llm/_torch/modules/rms_norm.py
Replaces Ellipsis-based optionalresidual with_ARGUMENT_NOT_SPECIFIED_SENTINEL; updatesforward/skip_forward signatures, presence checks, dtype handling, and conditional residual return behavior.
LlmRequest typing tweak
tensorrt_llm/_torch/pyexecutor/llm_request.py
Adds static type annotationOptional[torch.Tensor] for private_py_embedding_bias_1d; runtime initialization unchanged.
Tests: single-path TorchSampler test
tests/unittest/llmapi/test_llm_pytorch.py
Removes parameterization overenable_mixed_sampler andenable_logprobs; test simplified to a single TorchSampler path.
Test-list backend tag renames (YAML)
tests/integration/test_lists/test-db/l0_dgx_b200.yml,tests/integration/test_lists/test-db/l0_dgx_h100.yml
Renames backend test identifiers/tags (uppercase → lowercase / new backend-tag names) for specific GPTOSS entries; pure test-id string updates.
Packaging requirement bump
setup.py
Raisespython_requires from">=3.7, <4" to">=3.10, <4".

Sequence Diagram(s)

sequenceDiagram  autonumber  participant Client  participant TRTExec as TRT Executor  participant Sampler as TorchSampler  participant Gen as Generator  participant Req as LlmRequest[]  Client->>TRTExec: sample_async(requests)  TRTExec->>Sampler: _process_requests(requests, logits, states)  Sampler->>Sampler: _group_requests_by_sampling_strategy(requests)  Sampler->>Gen: get_generator(device)  par Per-strategy batched sampling    Sampler->>Sampler: pack indices (torch_multi_arange / indexers)    alt embedding-bias present      Sampler->>Sampler: _apply_embedding_bias    end    Sampler->>Sampler: sample(strategy, logits_batch, generator, softmax_indices?)  end  Sampler-->>TRTExec: _BatchedSamplingResult (tokens, softmax, mappings)  TRTExec->>TRTExec: unbatch -> update per-request states  TRTExec-->>Client: next tokens (+ optional probs)
Loading
sequenceDiagram  autonumber  participant Drafter as ModelDrafter  participant Sampler as TorchSampler  participant Req as LlmRequest(draft)  Drafter->>Sampler: should_provide_draft_probs(request)  Drafter->>Req: create(exclude_last_generation_logits=True, return_generation_logits=from sampler)  Drafter->>Sampler: sample_async(sample_state)  Sampler-->>Drafter: draft tokens (+/- draft probs)  Drafter->>Drafter: forward draft logits into target request (py_draft_logits)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • mikeiovine
  • nv-guomingz
  • netanel-haber
  • QiJune
  • syuoni
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat withCodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag@coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag@coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on oursupport page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type@coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add@coderabbit ignore or@coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add@coderabbit summary or@coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add@coderabbit or@coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit ourStatus Page to check the current availability of CodeRabbit.
  • Visit ourDocumentation for detailed information on how to use CodeRabbit.
  • Join ourDiscord Community to get help, request features, and share feedback.
  • Follow us onX/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

1-8:Missing NVIDIA license header

Add the standard 2025 NVIDIA SPDX header.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.+# SPDX-License-Identifier: Apache-2.0+
tensorrt_llm/_torch/modules/rms_norm.py (1)

56-62:FlashInfer path returns stale residual; semantics diverge from non-Flash path

When residual is provided, non-Flash returns (normed, input+residual). FlashInfer currently returns (normed, original residual). This breaks downstream skip/residual chaining.

Apply this fix to compute and return the updated residual consistently:

         if IS_FLASHINFER_AVAILABLE:             from ..custom_ops import (flashinfer_fused_add_rmsnorm,                                       flashinfer_rmsnorm)-            if residual is not None:-                flashinfer_fused_add_rmsnorm(hidden_states, residual,-                                             self.weight, self.variance_epsilon)+            if residual is not None:+                # Keep semantics consistent with the non-Flash path:+                # residual_out = (input + residual) in input dtype.+                input_dtype = hidden_states.dtype+                new_residual = (hidden_states.to(torch.float32) ++                                residual.to(torch.float32)).to(input_dtype)+                flashinfer_fused_add_rmsnorm(hidden_states, residual,+                                             self.weight, self.variance_epsilon)+                residual = new_residual             else:                 hidden_states = flashinfer_rmsnorm(hidden_states, self.weight,                                                    self.variance_epsilon)@@-        if residual is None:+        if residual is None:             return hidden_states         else:             return hidden_states, residual

Also applies to: 74-78

tensorrt_llm/_torch/pyexecutor/_util.py (2)

1-8:Missing NVIDIA license header

Add the standard 2025 NVIDIA SPDX header.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.+# SPDX-License-Identifier: Apache-2.0+

576-587:Remove staleenable_mixed_sampler reference in API stability tests

Thecreate_torch_sampler_args signature change is correctly reflected in code, and all internal callers have been updated. However, the API‐stability specification still lists the removedenable_mixed_sampler field:

• tests/unittest/api_stability/references/llm.yaml: lines 110–112 referenceenable_mixed_sampler

Please remove or update that entry so the stability tests remain in sync with the newTorchSampler.Args.

🧹 Nitpick comments (6)
tensorrt_llm/_torch/speculative/eagle3.py (1)

17-19:TYPE_CHECKING import: LGTM

Static-only import avoids runtime dependency cycles. Optionally enable postponed annotations to drop quotes.

Apply this small tweak (optional):

+from __future__ import annotations
tests/unittest/llmapi/test_llm_pytorch.py (1)

229-233:Broaden coverage: parametrize enable_logprobs for both True/False

Current test only covers logprobs=True. Include False to exercise the non-logprobs path with embedding_bias.

Apply:

-@pytest.mark.parametrize(-    "enable_logprobs",-    [True],-)+@pytest.mark.parametrize("enable_logprobs", [False, True])

Also applies to: 253-259

tensorrt_llm/_torch/pyexecutor/sampler.py (4)

166-192:Consider extracting magic values as constants

The batched sampling functions use hardcoded default values. Consider extracting them for maintainability.

+# Sampling defaults+DEFAULT_TOP_K = 50+DEFAULT_TOP_P = 0.9+DEFAULT_TEMPERATURE = 1.0+MIN_TEMPERATURE = 1e-5+ def top_k_sampling_batch(     logits,-    top_k=50,+    top_k=DEFAULT_TOP_K,     generator: Optional[torch.Generator] = None ) -> tuple[torch.Tensor, torch.Tensor]:
 def top_p_sampling_batch(     logits: torch.Tensor,-    top_p: float = 0.9,-    temperature: float = 1.0,+    top_p: float = DEFAULT_TOP_P,+    temperature: float = DEFAULT_TEMPERATURE,     generator: Optional[torch.Generator] = None ) -> tuple[torch.Tensor, torch.Tensor]:

353-361:Add type hints for improved maintainability

The function would benefit from explicit return type hints.

 def _group_requests_by_sampling_strategy(-        requests: Iterable[LlmRequest]) -> dict[Strategy, torch.IntTensor]:+        requests: Iterable[LlmRequest]) -> dict[Strategy, torch.IntTensor]:     strategy_dict: dict[Strategy, list[int]] = defaultdict(list)

1052-1190:Well-structured batched sampling pipeline

The_sample_batched_by_strategy method effectively groups requests by strategy and processes them in batches. The handling of draft logits indices for speculation is properly tracked.

One concern: The d2t validation at lines 1071-1075 might be better placed inModelDrafter.prepare_draft_tokens as suggested by the FIXME comment.

Consider moving the d2t validation to ModelDrafter as suggested by the FIXME comment to maintain better separation of concerns.


1332-1333:Fix line length violations

These lines exceed the 120-character limit.

-            #    range(next_context_req_offsets_cuda[i] - req_num_steps_fictitious_cuda[i], next_context_req_offsets_cuda[i])-            # And if generation requests are present, those tensors include a trailing entries selecting all tokens/logits+            #    range(next_context_req_offsets_cuda[i] - req_num_steps_fictitious_cuda[i],+            #          next_context_req_offsets_cuda[i])+            # And if generation requests are present, those tensors include a trailing+            # entries selecting all tokens/logits
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between8b21613 and91e916f.

📒 Files selected for processing (12)
  • tensorrt_llm/_torch/auto_deploy/llm_args.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (0 hunks)
  • tensorrt_llm/_torch/modules/rms_norm.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (23 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (2 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (6 hunks)
  • tensorrt_llm/llmapi/llm_args.py (0 hunks)
  • tensorrt_llm/scaffolding/worker.py (0 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
💤 Files with no reviewable changes (5)
  • tensorrt_llm/scaffolding/worker.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/speculative/eagle3.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/modules/rms_norm.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/speculative/eagle3.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/modules/rms_norm.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/_torch/speculative/model_drafter.py
🧬 Code graph analysis (6)
tensorrt_llm/_torch/speculative/eagle3.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
  • EagleDecodingConfig (418-444)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tests/unittest/llmapi/test_llm.py (1)
  • llm_test_harness (109-136)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
  • max_seq_len (558-568)
  • max_seq_len (571-575)
tensorrt_llm/_torch/modules/rms_norm.py (4)
tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py (2)
  • flashinfer_fused_add_rmsnorm (32-35)
  • flashinfer_rmsnorm (21-23)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • skip_forward (148-159)
tensorrt_llm/_torch/modules/layer_norm.py (1)
  • skip_forward (102-120)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
  • skip_forward (23-34)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
  • LlmRequest (282-422)
  • append (79-98)
  • append (125-142)
  • get_draft_token_length (556-567)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
tensorrt_llm/_torch/speculative/model_drafter.py (5)
tensorrt_llm/_torch/pyexecutor/sampler.py (20)
  • Sampler (52-83)
  • SampleState (43-49)
  • should_provide_draft_probs (81-83)
  • should_provide_draft_probs (1413-1415)
  • beam_width (72-75)
  • is_generation_model (78-79)
  • is_generation_model (109-110)
  • is_generation_model (162-163)
  • is_generation_model (718-719)
  • is_generation_model (1449-1450)
  • sample_async (63-65)
  • sample_async (93-96)
  • sample_async (134-141)
  • sample_async (940-956)
  • sample_async (1596-1684)
  • update_requests (68-69)
  • update_requests (99-106)
  • update_requests (144-159)
  • update_requests (901-925)
  • update_requests (1688-1701)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
  • LlmRequest (282-422)
  • append (79-98)
  • append (125-142)
  • generation_logits (214-223)
tensorrt_llm/_torch/pyexecutor/handle_logits.py (1)
  • HandleLogits (11-81)
tensorrt_llm/_torch/speculative/mtp.py (2)
  • sample_async (271-318)
  • update_requests (245-269)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

1332-1332: Line too long (125 > 120)

(E501)


1333-1333: Line too long (122 > 120)

(E501)

🔇 Additional comments (18)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

366-373:Type annotation addition: LGTM

optional 1D cache for embedding bias is clearly typed and initialized.

tensorrt_llm/_torch/modules/rms_norm.py (1)

49-53:No Ellipsis Passed toRMSNorm Residual
I ran targeted searches for any use of the literal... passed—either positionally or viaresidual=...—toRMSNorm (and its implicit__call__ toforward) and found zero occurrences. This confirms that no call sites propagateEllipsis into the newresidual=None API. The change is safe to merge.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

593-596:No change needed:engine.max_seq_len already covers both KV-cache and no-cache cases

Themax_seq_len property onTRTLLMEngine is implemented to:

  • return the KVCacheManager’s maximum sequence length when KV caching is enabled
  • fall back to the internally stored_max_seq_len_storage for no-cache attention

Thus, passingengine.max_seq_len intocreate_torch_sampler_args already applies the correct cap in both scenarios—no further modifications are required here.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/speculative/model_drafter.py (6)

59-59:LGTM! Improved clarity

The error message change from f-string to plain string is appropriate since there are no variables to interpolate.


85-86:Good comment explaining overlap scheduling

The comment explainingexclude_last_generation_logits=True for overlap scheduling provides valuable context.


268-285:Simplified sampling logic with proper error handling

The refactored_sample_async method:

  1. Correctly computesnum_context_logits_prefix_sum accounting forpy_return_context_logits
  2. Properly delegates toHandleLogits and the sampler'ssample_async
  3. Has appropriate error handling that returnsNone on failure

The removal of optional sampler handling simplifies the control flow.


300-301:LGTM! Simplified request update logic

The removal of the None-check guard for sampler is consistent with the overall simplification where the sampler is now always required.


317-317:Properly handles None propagation for draft logits

The comment indicates thatNone values are correctly forwarded when draft logits are not available, which maintains backward compatibility.


76-90:Draft logit policy integration verified

Theshould_provide_draft_probs API is declared inSampler (returnsTrue by default) and correctly overridden inTRTLLMSampler to disable draft logits under greedy sampling. A search shows its sole invocation for draft requests inmodel_drafter.py, ensuring consistent, per-request logic without residual flags.

tensorrt_llm/_torch/pyexecutor/sampler.py (9)

81-83:Conservative default for draft probability handling

The baseshould_provide_draft_probs method returnsTrue as a conservative default, ensuring backward compatibility. Derived classes can override for optimization.


276-285:Efficient greedy search with optional softmax filtering

Thesoftmax_indices parameter enables selective softmax computation, skipping unnecessary calculations for greedy sampling - a nice performance optimization.


288-310:Fixed potential issue with padded draft tokens

Good catch! The comment at line 291-292 correctly notes thatdraft_probs is not padded whiledraft_tokens may be padded byModelDrafter._pad_to_max_draft_tokens. The implementation correctly handles this by slicingdraft_tokens[:num_draft_tokens].


422-477:Excellent implementation of multi-arange utility

Thetorch_multi_arange function provides an efficient batched implementation with clear documentation and algorithm explanation. The use oftorch.repeat_interleave() andtorch.cumsum() to avoid loops is clever.


767-767:Good device consistency check

Adding the assertion ensures the generator is on the expected device, preventing subtle CUDA errors.


965-1050:Sophisticated embedding bias application

The_apply_embedding_bias implementation:

  1. Handles per-request bias tensors efficiently
  2. Uses smart batching to minimize memory operations
  3. Includes detailed comments explaining the trade-offs
  4. Correctly uses non-blocking transfers

The complexity is justified given the performance considerations.


1192-1286:Complex but necessary unbatching logic

The_unbatch_sampling_results method correctly handles:

  1. Dimension ordering verification
  2. Efficient scatter operations using custom indexers
  3. Proper handling of log probabilities with different tensor layouts
  4. Non-blocking D2H transfers for performance

The FIXMEs at lines 1256 and 1260 suggest potential optimizations for memory layout that could be addressed in future iterations.


1344-1410:Main processing pipeline successfully refactored

The_process_requests method integrates all the batched sampling components effectively:

  1. Computes per-request steps
  2. Selects generated logits
  3. Applies embedding bias
  4. Performs batched sampling by strategy
  5. Unbatches results

This is the core of the performance improvements replacing the old "fast_path" code.


1413-1415:Good optimization for greedy sampling

The override correctly avoids requesting draft probabilities for greedy sampling, which don't require rejection sampling.

@ixlmarixlmarforce-pushed thefeat/batched-sampling branch 2 times, most recently from2b209dc to6e89d3bCompareAugust 27, 2025 10:37
@ixlmarixlmar requested a review fromdcamporaAugust 27, 2025 10:38
@ixlmarixlmarforce-pushed thefeat/batched-sampling branch from6e89d3b tof750bbaCompareAugust 27, 2025 10:43
@ixlmar
Copy link
CollaboratorAuthor

/bot run

@ixlmarixlmar changed the title[TRTLLM-7156, TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler)[TRTLLM-7156][TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler)Aug 27, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16682 [ run ] triggered by Bot

@ixlmarixlmar changed the title[TRTLLM-7156][TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler)[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, TRTLLM-7156)Aug 27, 2025
@ixlmarixlmar changed the title[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, TRTLLM-7156)[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156)Aug 27, 2025
Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/speculative/model_drafter.py (1)

1-1:Add required NVIDIA copyright header

Per repo guidelines, prepend the NVIDIA copyright header (current year).

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. from __future__ import annotations
tensorrt_llm/_torch/pyexecutor/sampler.py (2)

288-310:get_rejected_indices(): fix index dtypes and shape bug for single-step case

  • Index tensors must be int64.
  • Removing squeeze(0) avoids breaking 2D indexing when num_steps == 1.

Apply:

-    token_idx = torch.arange(num_draft_tokens,-                             dtype=torch.int32,-                             device=generator.device)-    draft_tokens_cuda = torch.tensor(draft_tokens, dtype=torch.int32).to(+    token_idx = torch.arange(num_draft_tokens,+                             dtype=torch.long,+                             device=generator.device)+    draft_tokens_cuda = torch.tensor(draft_tokens, dtype=torch.long).to(         device=generator.device, non_blocking=True)     p = draft_probs[token_idx, draft_tokens_cuda]-    q = target_probs.squeeze(0)[token_idx, draft_tokens_cuda]+    q = target_probs[token_idx, draft_tokens_cuda]

763-777:Support multi-device by storing per-device RNG

Asserting a single device can break in multi-GPU contexts. Keep one deterministic generator per device.

Apply:

     def __init__(self, args: Args):@@-        self._generator = None+        self._generators: dict[torch.device, torch.Generator] = {}@@-    def get_generator(self, device: torch.device) -> torch.Generator:+    def get_generator(self, device: torch.device) -> torch.Generator:@@-        if self._generator is None:-            # Fallback to a default seed if not set-            self._generator = torch.Generator(device=device)-            self._generator.manual_seed(self._global_seed)-        assert self._generator.device == device-        return self._generator+        gen = self._generators.get(device)+        if gen is None or gen.device != device:+            gen = torch.Generator(device=device)+            gen.manual_seed(self._global_seed)+            self._generators[device] = gen+        return gen
🧹 Nitpick comments (2)
tensorrt_llm/_torch/speculative/model_drafter.py (1)

315-319:Forwarding generation logits: guard against tensor-device mismatches

Assigning req.py_result.generation_logits directly to target_model_req.py_draft_logits is fine; ensure both tensors reside on the same device (expected CUDA). If PyResult ever stores CPU tensors (e.g., for CPU-only tests), consider moving to the target device or documenting the assumption.

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

860-898:Consistent attribute: use py_seq_slot instead of seq_slot

Elsewhere the code uses py_seq_slot; mixing both can cause subtle bugs if they ever diverge.

Apply:

-            new_tokens[i, request.seq_slot, self.BEAM] = new_token+            new_tokens[i, request.py_seq_slot, self.BEAM] = new_token@@-            new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token+            new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between6e89d3b andf750bba.

📒 Files selected for processing (12)
  • tensorrt_llm/_torch/auto_deploy/llm_args.py (0 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (0 hunks)
  • tensorrt_llm/_torch/modules/rms_norm.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (23 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (2 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (6 hunks)
  • tensorrt_llm/llmapi/llm_args.py (0 hunks)
  • tensorrt_llm/scaffolding/worker.py (0 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
💤 Files with no reviewable changes (5)
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/scaffolding/worker.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/modules/rms_norm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/speculative/model_drafter.py (6)
tensorrt_llm/_torch/pyexecutor/sampler.py (20)
  • Sampler (52-83)
  • SampleState (43-49)
  • should_provide_draft_probs (81-83)
  • should_provide_draft_probs (1433-1435)
  • beam_width (72-75)
  • is_generation_model (78-79)
  • is_generation_model (109-110)
  • is_generation_model (162-163)
  • is_generation_model (727-728)
  • is_generation_model (1469-1470)
  • sample_async (63-65)
  • sample_async (93-96)
  • sample_async (134-141)
  • sample_async (951-967)
  • sample_async (1616-1704)
  • update_requests (68-69)
  • update_requests (99-106)
  • update_requests (144-159)
  • update_requests (911-935)
  • update_requests (1708-1721)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
tensorrt_llm/llmapi/llm_args.py (1)
  • DecodingBaseConfig (349-404)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
  • LlmRequest (282-424)
  • append (79-98)
  • append (125-142)
  • generation_logits (214-223)
tensorrt_llm/_torch/pyexecutor/handle_logits.py (1)
  • HandleLogits (11-81)
tensorrt_llm/_torch/speculative/mtp.py (2)
  • sample_async (271-318)
  • update_requests (245-269)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
  • LlmRequest (282-424)
  • get_draft_token_length (558-569)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

1353-1353: Line too long (125 > 120)

(E501)

🔇 Additional comments (3)
tensorrt_llm/_torch/speculative/model_drafter.py (1)

271-282:Prefix-sum logic: confirm 1-logit accounting when context logits are disabled

You count 1 per-context request when return_context_logits is False, to keep the last-token logit for potential generation logits. Please verify this matches upstream model output layout for all decoders.

Would you like a quick script to assert that lengths of num_context_logits_prefix_sum align with actual logits slices during tests?

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

1433-1436:Good: draft probs suppressed for greedy

Skipping draft probabilities when greedy is selected avoids unnecessary softmax work.


166-173:Use LongTensor for indices and adjust type hints

Indexing tensors in PyTorch requires int64 (LongTensor). Using IntTensor/int32 for indices will raise runtime errors in advanced indexing and index_select.

Apply:

-def top_k_sampling_batch(+def top_k_sampling_batch(     logits,     top_k=50,     generator: Optional[torch.Generator] = None ) -> tuple[torch.Tensor, torch.Tensor]:

And below (see related comments) ensure all index tensors are int64; also change any annotations using torch.IntTensor to torch.Tensor and cast to long at use sites.

Likely an incorrect or invalid review comment.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16682 [ run ] completed with stateFAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12523 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)

230-253:Force deterministic greedy sampling and avoid numeric overflow in embedding bias

The test currently relies on default sampling behavior (temperature=None → C++ default 1.0f) and an extreme bias value (torch.finfo(torch.float32).max), which can lead to non-greedy sampling and potential numeric instabilities (infs or NaNs). To ensure determinism and stability across backends, please update the test as follows:

• Intests/unittest/llmapi/test_llm_pytorch.py attest_embedding_bias_with_torch_sampler_strategies() (around line 243):

  • Replace the use oftorch.finfo(torch.float32).max with a large but finite bias (e.g.1e6).
  • Addtemperature=0.0 (to force greedy sampling) andtop_k=1 (to restrict choices to the top token).

Suggested diff:

 def test_embedding_bias_with_torch_sampler_strategies():@@-    vocab_size_padded = 32000-    embedding_bias = torch.zeros(vocab_size_padded)-    embedding_bias[biased_word_id] = torch.finfo(torch.float32).max+    vocab_size_padded = 32000  # consider deriving from tokenizer/model if it varies+    embedding_bias = torch.zeros(vocab_size_padded)+    embedding_bias[biased_word_id] = 1e6  # large finite bias to avoid overflow@@-    sampling_kwargs = {-        "max_tokens": 6,-        "embedding_bias": embedding_bias,-    }+    sampling_kwargs = {+        "max_tokens": 6,+        "temperature": 0.0,   # force deterministic greedy sampling+        "top_k": 1,           # restrict to the highest-scoring token+        "embedding_bias": embedding_bias,+    }

These changes will make the test behavior consistent and robust across different PyTorch builds and hardware.

tensorrt_llm/_torch/modules/rms_norm.py (1)

63-71:Inconsistent residual semantics between FLASHINFER and non-FLASH paths.

Non-FLASH returns updated residual (pre-norm sum), FLASHINFER returns the original residual. This diverges behavior and can corrupt downstream fused-add pipelines.

Apply:

         if IS_FLASHINFER_AVAILABLE:             from ..custom_ops import (flashinfer_fused_add_rmsnorm,                                       flashinfer_rmsnorm)-            if residual is not None:-                flashinfer_fused_add_rmsnorm(hidden_states, residual,-                                             self.weight, self.variance_epsilon)+            if residual is not None:+                # Preserve parity with non-FLASH path: return pre-norm sum as residual+                input_dtype = hidden_states.dtype+                _updated_residual = (hidden_states.to(torch.float32) ++                                     residual.to(torch.float32)).to(input_dtype)+                flashinfer_fused_add_rmsnorm(hidden_states, residual,+                                             self.weight, self.variance_epsilon)+                residual = _updated_residual             else:                 hidden_states = flashinfer_rmsnorm(hidden_states, self.weight,                                                    self.variance_epsilon)
🧹 Nitpick comments (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)

234-241:Consider deriving vocab_size_padded instead of hard-coding 32000.

Hard-coding may break with model/tokenizer variants that pad differently.

I can propose a helper to compute padded size from tokenizer vocab and alignment used by the sampler if you confirm the pad multiple.

tensorrt_llm/_torch/modules/rms_norm.py (1)

75-78:Clarify intent: residual holds pre-norm sum (not original residual).

This is correct for fused-add pipelines; consider a brief comment to avoid future regressions.

Apply:

-            if residual is not None:+            if residual is not None:+                # Update residual to the pre-norm sum(hidden_states, residual) for downstream fused-add                 hidden_states = hidden_states + residual.to(torch.float32)                 residual = hidden_states.to(input_dtype)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and betweenf750bba and8d6a5a6.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/rms_norm.py (5 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/modules/rms_norm.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/_torch/modules/rms_norm.py
🧬 Code graph analysis (2)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tests/unittest/llmapi/test_llm.py (1)
  • llm_test_harness (109-136)
tensorrt_llm/_torch/modules/rms_norm.py (3)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • skip_forward (148-159)
tensorrt_llm/_torch/modules/layer_norm.py (1)
  • skip_forward (102-120)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
  • skip_forward (23-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/rms_norm.py (1)

63-71:In-place semantics confirmed for flashinfer_fused_add_rmsnorm
Verified thatflashinfer_fused_add_rmsnorm is registered as a custom op withmutates_args=("input", "residual") and its Python signature returnsNone, meaning it mutateshidden_states (andresidual) in-place. No changes required.

@ixlmar
Copy link
CollaboratorAuthor

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16697 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between8d6a5a6 and12e937b.

📒 Files selected for processing (2)
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml (2 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nvPR: NVIDIA/TensorRT-LLM#7192File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72Timestamp: 2025-08-26T09:49:04.956ZLearning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)

97-101:LGTM: standardized lowercase backend specifiers for GPTOSS post_merge.

Consistent with the broader rename pattern in this PR.

tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)

154-159:LGTM: backend tag casing normalized (CUTLASS/TRITON → cutlass/triton).

Matches the naming scheme adopted elsewhere; should reduce param-id drift across files.

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/rms_norm.py (1)

64-73:Residual return semantics diverge between FLASHINFER and fallback paths.

Fallback returns the pre-norm sum asresidual (Line 78), while the FLASHINFER branch returns the original inputresidual unchanged, breaking parity and likely confusing downstream consumers expecting the updated residual. Fix by computing the summed residual for return in the FLASHINFER path without altering the inputs to the fused op.

         if IS_FLASHINFER_AVAILABLE:             from ..custom_ops import (flashinfer_fused_add_rmsnorm,                                       flashinfer_rmsnorm)-            if residual is not None:-                flashinfer_fused_add_rmsnorm(hidden_states, residual,-                                             self.weight, self.variance_epsilon)+            if residual is not None:+                # Maintain the same return semantics as the non-FLASHINFER path:+                # return the pre-norm sum as residual.+                input_dtype = hidden_states.dtype+                residual_out = (hidden_states.to(torch.float32)+                                + residual.to(torch.float32)).to(input_dtype)+                flashinfer_fused_add_rmsnorm(hidden_states,  # normalized in-place/return+                                             residual,+                                             self.weight,+                                             self.variance_epsilon)+                residual = residual_out             else:                 hidden_states = flashinfer_rmsnorm(hidden_states, self.weight,                                                    self.variance_epsilon)
🧹 Nitpick comments (4)
tensorrt_llm/_torch/modules/rms_norm.py (4)

28-30:Make sentinel definition clearer and immutable.

Minor readability/typing tweak: define the alias before the constant and mark the sentinelFinal so type-checkers treat it as immutable.

-from typing import Optional, Tuple, TypeAlias, Union, cast+from typing import Final, Optional, Tuple, TypeAlias, Union, cast@@-    _ARGUMENT_NOT_SPECIFIED_SENTINEL = ...-    _ArgumentNotSpecifiedSentinelType: TypeAlias = EllipsisType+    _ArgumentNotSpecifiedSentinelType: TypeAlias = EllipsisType+    _ARGUMENT_NOT_SPECIFIED_SENTINEL: Final[_ArgumentNotSpecifiedSentinelType] = ...

55-63:Streamline the “unspecified residual” logic.

Equivalent but tighter boolean makes intent obvious and avoids the extra flag mutation.

-        return_residual = True-        if residual is self._ARGUMENT_NOT_SPECIFIED_SENTINEL:-            return_residual = False-            residual = None+        return_residual = residual is not self._ARGUMENT_NOT_SPECIFIED_SENTINEL+        if not return_residual:+            residual = None

76-79:Avoid an extra allocation in the fallback add.

You can do the add in-place on the FP32 buffer to reduce one temporary tensor.

-            if residual is not None:-                hidden_states = hidden_states + residual.to(torch.float32)-                residual = hidden_states.to(input_dtype)+            if residual is not None:+                hidden_states.add_(residual.to(torch.float32))+                residual = hidden_states.to(input_dtype)

Note: ensure no autograd requirements/in-place aliasing concerns in this inference path.


110-116:Fix mutable default and Optional types in group_rms_norm signature.

weights uses a mutable default[], andeps/weight_bias are annotatedOptional but never acceptNone. Clean this to avoid surprises.

-def group_rms_norm(-        inputs: list[torch.Tensor],-        weights: Optional[list[torch.Tensor]] = [],-        eps: Optional[float] = 1e-5,-        weight_bias: Optional[float] = 0.0,+def group_rms_norm(+        inputs: list[torch.Tensor],+        weights: Optional[list[torch.Tensor]] = None,+        eps: float = 1e-5,+        weight_bias: float = 0.0,         kernel: GroupRMSNormKernelSelection = GroupRMSNormKernelSelection.heuristic,         outputs: Optional[list[torch.Tensor]] = None) -> list[torch.Tensor]:@@-    out = outputs+    out = outputs     if out is None:         out = [torch.empty_like(input) for input in inputs]+    if weights is None:+        weights = []
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between12e937b and72ae320.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/modules/rms_norm.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/modules/rms_norm.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/modules/rms_norm.py
🧠 Learnings (3)
📚 Learning: 2025-08-27T14:41:56.613Z
Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/modules/rms_norm.py:96-99Timestamp: 2025-08-27T14:41:56.613ZLearning: In tensorrt_llm/_torch/modules/rms_norm.py, the RMSNorm class uses a custom sentinel (_ARGUMENT_NOT_SPECIFIED_SENTINEL) instead of Ellipsis (...) for detecting unspecified optional arguments. Other modules in the codebase may use Ellipsis as a sentinel but do not forward it to RMSNorm methods, so there's no need for backward compatibility with Ellipsis in RMSNorm.

Applied to files:

  • tensorrt_llm/_torch/modules/rms_norm.py
📚 Learning: 2025-08-27T14:23:55.545Z
Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/modules/rms_norm.py:17-17Timestamp: 2025-08-27T14:23:55.545ZLearning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

  • tensorrt_llm/_torch/modules/rms_norm.py
📚 Learning: 2025-08-25T22:42:47.612Z
Learnt from: CRPR: NVIDIA/TensorRT-LLM#0File: CODING_GUIDELINES.md:0-0Timestamp: 2025-08-25T22:42:47.612ZLearning: Applies to **/*.py : Code must target Python 3.8+

Applied to files:

  • tensorrt_llm/_torch/modules/rms_norm.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/modules/rms_norm.py (3)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • skip_forward (148-159)
tensorrt_llm/_torch/modules/layer_norm.py (1)
  • skip_forward (102-120)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
  • skip_forward (23-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/modules/rms_norm.py (3)

17-18:Python 3.10+ typing usage is fine; ensure packaging/CI reflect this floor.

UsingTypeAlias andEllipsisType is correct given the project’s effective Python ≥3.10 policy. Please confirmpython_requires and CI images align (≥3.10) to avoid drift.


85-88:Return contract looks good.

Tuple is returned only when the caller explicitly provided aresidual (includingNone), matching the sentinel contract.


93-101:skip_forward parity with sentinel handling is correct.

Behavior mirrorsforward: unspecified → single tensor; specified (evenNone) → tuple. Good.

@ixlmar
Copy link
CollaboratorAuthor

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16697 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12532 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16712 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)

769-783:Make generators per-device; current assert breaks multi-GPU

A single cached generator with a hard device-assert will fail when sampling across devices.

-        self._global_seed = 42-        self._generator = None+        self._global_seed = 42+        self._generators: dict[torch.device, torch.Generator] = {}@@-    def get_generator(self, device: torch.device) -> torch.Generator:+    def get_generator(self, device: torch.device) -> torch.Generator:@@-        if self._generator is None:-            # Fallback to a default seed if not set-            self._generator = torch.Generator(device=device)-            self._generator.manual_seed(self._global_seed)-        assert self._generator.device == device-        return self._generator+        gen = self._generators.get(device)+        if gen is None:+            gen = torch.Generator(device=device)+            gen.manual_seed(self._global_seed)+            self._generators[device] = gen+        return gen

885-897:Use py_seq_slot consistently (fixes wrong writes with binding seq_slot)

TorchSampler elsewhere uses py_seq_slot; mixing in seq_slot can misroute writes.

-            new_tokens[i, request.seq_slot, self.BEAM] = new_token+            new_tokens[i, request.py_seq_slot, self.BEAM] = new_token@@-            new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token+            new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token

Also applies to: 899-904

♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)

374-397:Int32 softmax_indices and filtering behavior LGTM

Type hint and greedy softmax-skipping logic align with established TRT-LLM conventions.


1-2:Add required NVIDIA copyright header

Per repo guidelines, prepend the NVIDIA header.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.+ import dataclasses import enum

1018-1064:_apply_embedding_bias(): use id()-based dedup and Long indices for index_select

Tensor keys are unsafe/expensive; index_select requires Long indices.

-        # Indices of unique bias tensors-        bias_to_index: dict[torch.Tensor,-                            int] = defaultdict(provision_bias_index)+        # Indices of unique bias tensors (group by identity, not contents)+        bias_to_index: dict[int, int] = defaultdict(provision_bias_index)+        biases: list[torch.Tensor] = []@@-            if req_bias is not None:+            if req_bias is not None:                 logits_bias_mask[i:(i + steps)] = True-                req_bias_index = bias_to_index[req_bias]+                key = id(req_bias)+                req_bias_index = bias_to_index[key]+                if req_bias_index == len(biases):+                    biases.append(req_bias)                 bias_gather_indices.extend(repeat(req_bias_index, steps))@@-        bias_gather_indices_cuda = torch.tensor(bias_gather_indices,-                                                pin_memory=True,-                                                dtype=torch.int32).to(-                                                    logits.device,-                                                    non_blocking=True)+        bias_gather_indices_cuda = torch.tensor(+            bias_gather_indices, pin_memory=True, dtype=torch.long+        ).to(logits.device, non_blocking=True)@@-        biases_tensor = torch.empty((len(bias_to_index), *req_bias.shape),-                                    pin_memory=True)-        biases_tensor = torch.stack(-            tuple(bias_to_index.keys()),-            out=biases_tensor,-        )+        biases_tensor = torch.stack(+            biases, out=torch.empty((len(biases), *req_bias.shape), pin_memory=True)+        )@@-        biases_tensor_cuda = torch.index_select(biases_tensor_cuda, 0,-                                                bias_gather_indices_cuda)+        biases_tensor_cuda = torch.index_select(+            biases_tensor_cuda, 0, bias_gather_indices_cuda+        )
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

39-56:Confirm minimum supported Python version (uses 3.10+ features)

This file uses kw_only dataclasses, PEP 634 pattern matching (match/case), and PEP 604 unions (X | Y), which require Python 3.10+. Repo guideline says “Python 3.8+”; please confirm runtime targets 3.10+ or adjust.

Would you like a codemod to replace match/case with if/elif and kw_only with runtime-safe alternatives if 3.8/3.9 must be supported?

Also applies to: 119-133, 374-397

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between72ae320 andfae40c5.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (24 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (9)
📚 Learning: 2025-08-25T22:42:47.612Z
Learnt from: CRPR: NVIDIA/TensorRT-LLM#0File: CODING_GUIDELINES.md:0-0Timestamp: 2025-08-25T22:42:47.612ZLearning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagamPR: NVIDIA/TensorRT-LLM#6487File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12Timestamp: 2025-08-06T13:58:07.506ZLearning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:03:57.127Z
Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392Timestamp: 2025-08-27T15:03:57.127ZLearning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:52:26.324Z
Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:985-1064Timestamp: 2025-08-27T14:52:26.324ZLearning: In TensorRT-LLM's _apply_embedding_bias() method, tensor deduplication should use id() instead of the tensor itself as a dict key to avoid scanning tensor contents and for better performance.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:47:49.339Z
Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:1229-1238Timestamp: 2025-08-27T14:47:49.339ZLearning: torch.Tensor.dim_order is a public API in PyTorch that returns the dimension order of a tensor, documented at https://docs.pytorch.org/docs/stable/generated/torch.Tensor.dim_order.html. It supports an ambiguity_check parameter that can be a boolean or list of memory formats.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartierPR: NVIDIA/TensorRT-LLM#6763File: tests/integration/defs/triton_server/conftest.py:16-22Timestamp: 2025-08-11T20:09:24.389ZLearning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsianPR: NVIDIA/TensorRT-LLM#7101File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36Timestamp: 2025-08-21T00:16:56.457ZLearning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidiaPR: NVIDIA/TensorRT-LLM#6506File: examples/models/core/mixtral/requirements.txt:3-3Timestamp: 2025-08-01T15:14:45.673ZLearning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
  • LlmRequest (282-424)
  • get_draft_token_length (558-569)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

1359-1359: Line too long (125 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

1235-1244:dim_order() usage OK

Public API; assertions ensure canonical layout before view/scatter.

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

885-897:Usepy_seq_slot consistently;seq_slot may be different/undefined here.

Other TorchSampler paths usepy_seq_slot. Mixing the two risks indexing the wrong slot or AttributeError.

Apply:

-            new_tokens[i, request.seq_slot, self.BEAM] = new_token+            new_tokens[i, request.py_seq_slot, self.BEAM] = new_token@@-            new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token+            new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)

1-2:Add required NVIDIA copyright header (current year).

Repository guideline requires the header on all source files.

Apply:

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.+ import dataclasses import enum
🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/sampler.py (5)

184-191:Guard top-k against k > vocab_size.

Prevents runtime errors when configs overspecify k.

Apply:

-    if top_k > 0:-        values, indices = torch.topk(logits, top_k, dim=-1)+    if top_k > 0:+        top_k = min(top_k, vocab_size)+        values, indices = torch.topk(logits, top_k, dim=-1)

249-256:Same guard for top-k in top-k+top-p path.

Apply:

-    if top_k > 0:-        values, _ = torch.topk(logits, top_k, dim=-1)+    if top_k > 0:+        top_k = min(top_k, vocab_size)+        values, _ = torch.topk(logits, top_k, dim=-1)

778-783:Make generator device-robust (multi-GPU).

Reinitialize when device changes instead of asserting.

Apply:

-        if self._generator is None:-            # Fallback to a default seed if not set-            self._generator = torch.Generator(device=device)-            self._generator.manual_seed(self._global_seed)-        assert self._generator.device == device+        if self._generator is None or self._generator.device != device:+            self._generator = torch.Generator(device=device)+            self._generator.manual_seed(self._global_seed)         return self._generator

1026-1066:Prefer id()-based dedup for clarity; avoid relying on tensor keys.

Identity hashing of tensors is fine, but explicit id() clarifies intent and decouples from tensor-object key semantics.

Apply:

-        bias_to_index: dict[torch.Tensor,-                            int] = defaultdict(provision_bias_index)+        bias_to_index: dict[int, int] = defaultdict(provision_bias_index)+        biases: list[torch.Tensor] = []@@-            if req_bias is not None:+            if req_bias is not None:                 logits_bias_mask[i:(i + steps)] = True-                req_bias_index = bias_to_index[req_bias]+                key = id(req_bias)+                req_bias_index = bias_to_index[key]+                if req_bias_index == len(biases):+                    biases.append(req_bias)                 bias_gather_indices.extend(repeat(req_bias_index, steps))@@-        biases_tensor = torch.empty((len(bias_to_index), *req_bias.shape),-                                    pin_memory=True)-        biases_tensor = torch.stack(-            tuple(bias_to_index.keys()),-            out=biases_tensor,-        )+        biases_tensor = torch.empty((len(biases), *req_bias.shape), pin_memory=True)+        biases_tensor = torch.stack(tuple(biases), out=biases_tensor)

52-56:Annotate optional fields as Optional[...] for type correctness.

Current defaults are None but annotations are non-optional.

Apply:

 class SampleState:     scheduled_requests: ScheduledRequests--    device: SampleStateTensors = None-    host: SampleStateTensors = None--    sampler_event: torch.cuda.Event = None+    device: Optional[SampleStateTensors] = None+    host: Optional[SampleStateTensors] = None+    sampler_event: Optional[torch.cuda.Event] = None
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and betweenfae40c5 and81660f3.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/sampler.py (24 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (9)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagamPR: NVIDIA/TensorRT-LLM#6487File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12Timestamp: 2025-08-06T13:58:07.506ZLearning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-25T22:42:47.612Z
Learnt from: CRPR: NVIDIA/TensorRT-LLM#0File: CODING_GUIDELINES.md:0-0Timestamp: 2025-08-25T22:42:47.612ZLearning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:03:57.127Z
Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392Timestamp: 2025-08-27T15:03:57.127ZLearning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:52:26.324Z
Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:985-1064Timestamp: 2025-08-27T14:52:26.324ZLearning: In TensorRT-LLM's _apply_embedding_bias() method, tensor deduplication should use id() instead of the tensor itself as a dict key to avoid scanning tensor contents and for better performance.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T14:47:49.339Z
Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:1229-1238Timestamp: 2025-08-27T14:47:49.339ZLearning: torch.Tensor.dim_order is a public API in PyTorch that returns the dimension order of a tensor, documented at https://docs.pytorch.org/docs/stable/generated/torch.Tensor.dim_order.html. It supports an ambiguity_check parameter that can be a boolean or list of memory formats.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartierPR: NVIDIA/TensorRT-LLM#6763File: tests/integration/defs/triton_server/conftest.py:16-22Timestamp: 2025-08-11T20:09:24.389ZLearning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsianPR: NVIDIA/TensorRT-LLM#7101File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36Timestamp: 2025-08-21T00:16:56.457ZLearning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
📚 Learning: 2025-08-27T15:37:38.228Z
Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:991-1070Timestamp: 2025-08-27T15:37:38.228ZLearning: PyTorch tensors are hashable as dict keys because hash(torch.Tensor) is equivalent to id(torch.Tensor) and uses identity-based hashing, not content-based hashing. This means tensor hashing doesn't scan tensor contents and tensors can be safely used as dictionary keys.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (3)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
  • LlmRequest (282-424)
  • get_draft_token_length (558-569)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • all_requests (38-39)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/sampler.py

1362-1362: Line too long (125 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)

359-371:Int32 indexing usage is consistent with established sampler design.

Acknowledging prior discussion: keeping int32 for index tensors here is intentional and works in this codepath.

Also applies to: 282-292, 1188-1194, 1301-1305, 430-488, 504-523, 1238-1247, 1260-1267


1229-1247:Use of torch.Tensor.dim_order is valid.

Public API; check is correct for canonical row-major ordering.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16712 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12543 completed with status: 'FAILURE'

@ixlmar
Copy link
CollaboratorAuthor

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16722 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16722 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12552 completed with status: 'FAILURE'

Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
auto-merge was automatically disabledSeptember 22, 2025 09:39

Head branch was pushed to by a user without write access

@ixlmar
Copy link
CollaboratorAuthor

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19563 [ run ] triggered by Bot

Copy link
Collaborator

@DomBrownDomBrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

Copy link
Collaborator

@SuperjomnSuperjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM on the llmapi changes.

Copy link
Collaborator

@syuonisyuoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19563 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14707 completed with status: 'FAILURE'

@chzblych
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19638 [ run ] triggered by Bot

@ixlmar
Copy link
CollaboratorAuthor

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19666 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19638 [ run ] completed with stateABORTED
LLM/main/L0_MergeRequest_PR #14773(Blue Ocean) completed with status: ABORTED

@dcamporadcamporaenabled auto-merge (squash)September 23, 2025 09:03
@tensorrt-cicd
Copy link
Collaborator

PR_Github #19666 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14799 completed with status: 'SUCCESS'

@dcamporadcampora merged commit9970345 intoNVIDIA:mainSep 23, 2025
5 checks passed
@ixlmarixlmar deleted the feat/batched-sampling branchSeptember 24, 2025 08:16
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@FunatiqFunatiqFunatiq left review comments

@coderabbitaicoderabbitai[bot]coderabbitai[bot] left review comments

@SuperjomnSuperjomnSuperjomn approved these changes

@dcamporadcamporadcampora approved these changes

@DomBrownDomBrownDomBrown approved these changes

@litaotjulitaotjulitaotju approved these changes

@lucaslielucaslielucaslie approved these changes

@byshiuebyshiuebyshiue approved these changes

@syuonisyuonisyuoni approved these changes

@QiJuneQiJuneQiJune approved these changes

@schetlur-nvschetlur-nvschetlur-nv approved these changes

@netanel-habernetanel-haberAwaiting requested review from netanel-habernetanel-haber is a code owner automatically assigned from NVIDIA/trt-llm-noncommitted-api-review-committee

@kris1025kris1025Awaiting requested review from kris1025kris1025 is a code owner automatically assigned from NVIDIA/trt-llm-torch-spec-decoding

@achartierachartierAwaiting requested review from achartierachartier is a code owner automatically assigned from NVIDIA/trt-llm-torch-runtime-devs

@Wanli-JiangWanli-JiangAwaiting requested review from Wanli-JiangWanli-Jiang was automatically assigned from NVIDIA/trt-llm-torch-models-devs

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

13 participants

@ixlmar@tensorrt-cicd@chzblych@Superjomn@dcampora@DomBrown@litaotju@lucaslie@byshiue@Funatiq@syuoni@QiJune@schetlur-nv

[8]ページ先頭

©2009-2025 Movatter.jp