Movatterモバイル変換

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
1-8:Missing NVIDIA license header
Add the standard 2025 NVIDIA SPDX header.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.+# SPDX-License-Identifier: Apache-2.0+
tensorrt_llm/_torch/modules/rms_norm.py (1)
56-62:FlashInfer path returns stale residual; semantics diverge from non-Flash path
When residual is provided, non-Flash returns (normed, input+residual). FlashInfer currently returns (normed, original residual). This breaks downstream skip/residual chaining.
Apply this fix to compute and return the updated residual consistently:
         if IS_FLASHINFER_AVAILABLE:             from ..custom_ops import (flashinfer_fused_add_rmsnorm,                                       flashinfer_rmsnorm)-            if residual is not None:-                flashinfer_fused_add_rmsnorm(hidden_states, residual,-                                             self.weight, self.variance_epsilon)+            if residual is not None:+                # Keep semantics consistent with the non-Flash path:+                # residual_out = (input + residual) in input dtype.+                input_dtype = hidden_states.dtype+                new_residual = (hidden_states.to(torch.float32) ++                                residual.to(torch.float32)).to(input_dtype)+                flashinfer_fused_add_rmsnorm(hidden_states, residual,+                                             self.weight, self.variance_epsilon)+                residual = new_residual             else:                 hidden_states = flashinfer_rmsnorm(hidden_states, self.weight,                                                    self.variance_epsilon)@@-        if residual is None:+        if residual is None:             return hidden_states         else:             return hidden_states, residual
Also applies to: 74-78
tensorrt_llm/_torch/pyexecutor/_util.py (2)
1-8:Missing NVIDIA license header
Add the standard 2025 NVIDIA SPDX header.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.+# SPDX-License-Identifier: Apache-2.0+
576-587:Remove staleenable_mixed_sampler reference in API stability tests
Thecreate_torch_sampler_args signature change is correctly reflected in code, and all internal callers have been updated. However, the API‐stability specification still lists the removedenable_mixed_sampler field:
• tests/unittest/api_stability/references/llm.yaml: lines 110–112 referenceenable_mixed_sampler
Please remove or update that entry so the stability tests remain in sync with the newTorchSampler.Args.

🧹 Nitpick comments (6)

tensorrt_llm/_torch/speculative/eagle3.py (1)
17-19:TYPE_CHECKING import: LGTM
Static-only import avoids runtime dependency cycles. Optionally enable postponed annotations to drop quotes.
Apply this small tweak (optional):
+from __future__ import annotations
tests/unittest/llmapi/test_llm_pytorch.py (1)
229-233:Broaden coverage: parametrize enable_logprobs for both True/False
Current test only covers logprobs=True. Include False to exercise the non-logprobs path with embedding_bias.
Apply:
-@pytest.mark.parametrize(-    "enable_logprobs",-    [True],-)+@pytest.mark.parametrize("enable_logprobs", [False, True])
Also applies to: 253-259
tensorrt_llm/_torch/pyexecutor/sampler.py (4)
166-192:Consider extracting magic values as constants
The batched sampling functions use hardcoded default values. Consider extracting them for maintainability.
+# Sampling defaults+DEFAULT_TOP_K = 50+DEFAULT_TOP_P = 0.9+DEFAULT_TEMPERATURE = 1.0+MIN_TEMPERATURE = 1e-5+ def top_k_sampling_batch(     logits,-    top_k=50,+    top_k=DEFAULT_TOP_K,     generator: Optional[torch.Generator] = None ) -> tuple[torch.Tensor, torch.Tensor]:
 def top_p_sampling_batch(     logits: torch.Tensor,-    top_p: float = 0.9,-    temperature: float = 1.0,+    top_p: float = DEFAULT_TOP_P,+    temperature: float = DEFAULT_TEMPERATURE,     generator: Optional[torch.Generator] = None ) -> tuple[torch.Tensor, torch.Tensor]:
353-361:Add type hints for improved maintainability
The function would benefit from explicit return type hints.
 def _group_requests_by_sampling_strategy(-        requests: Iterable[LlmRequest]) -> dict[Strategy, torch.IntTensor]:+        requests: Iterable[LlmRequest]) -> dict[Strategy, torch.IntTensor]:     strategy_dict: dict[Strategy, list[int]] = defaultdict(list)
1052-1190:Well-structured batched sampling pipeline
The_sample_batched_by_strategy method effectively groups requests by strategy and processes them in batches. The handling of draft logits indices for speculation is properly tracked.
One concern: The d2t validation at lines 1071-1075 might be better placed inModelDrafter.prepare_draft_tokens as suggested by the FIXME comment.
Consider moving the d2t validation to ModelDrafter as suggested by the FIXME comment to maintain better separation of concerns.
1332-1333:Fix line length violations
These lines exceed the 120-character limit.
-            #    range(next_context_req_offsets_cuda[i] - req_num_steps_fictitious_cuda[i], next_context_req_offsets_cuda[i])-            # And if generation requests are present, those tensors include a trailing entries selecting all tokens/logits+            #    range(next_context_req_offsets_cuda[i] - req_num_steps_fictitious_cuda[i],+            #          next_context_req_offsets_cuda[i])+            # And if generation requests are present, those tensors include a trailing+            # entries selecting all tokens/logits

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between8b21613 and91e916f.

📒 Files selected for processing (12)

tensorrt_llm/_torch/auto_deploy/llm_args.py (0 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (0 hunks)
tensorrt_llm/_torch/modules/rms_norm.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (0 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (23 hunks)
tensorrt_llm/_torch/speculative/eagle3.py (2 hunks)
tensorrt_llm/_torch/speculative/model_drafter.py (6 hunks)
tensorrt_llm/llmapi/llm_args.py (0 hunks)
tensorrt_llm/scaffolding/worker.py (0 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)

💤 Files with no reviewable changes (5)

tensorrt_llm/scaffolding/worker.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/speculative/eagle3.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/modules/rms_norm.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/speculative/model_drafter.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/speculative/eagle3.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/modules/rms_norm.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/speculative/model_drafter.py

🧬 Code graph analysis (6)

tensorrt_llm/_torch/speculative/eagle3.py (1)

tensorrt_llm/llmapi/llm_args.py (1)
EagleDecodingConfig (418-444)

tests/unittest/llmapi/test_llm_pytorch.py (1)

tests/unittest/llmapi/test_llm.py (1)
llm_test_harness (109-136)

tensorrt_llm/_torch/pyexecutor/_util.py (1)

tensorrt_llm/_torch/attention_backend/trtllm.py (2)
max_seq_len (558-568)
max_seq_len (571-575)

tensorrt_llm/_torch/modules/rms_norm.py (4)

tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py (2)
flashinfer_fused_add_rmsnorm (32-35)
flashinfer_rmsnorm (21-23)
tensorrt_llm/_torch/models/modeling_utils.py (1)
skip_forward (148-159)
tensorrt_llm/_torch/modules/layer_norm.py (1)
skip_forward (102-120)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
skip_forward (23-34)

tensorrt_llm/_torch/pyexecutor/sampler.py (3)

tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
LlmRequest (282-422)
append (79-98)
append (125-142)
get_draft_token_length (556-567)
tensorrt_llm/_torch/speculative/mtp.py (1)
sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests (18-39)
all_requests (38-39)

tensorrt_llm/_torch/speculative/model_drafter.py (5)

tensorrt_llm/_torch/pyexecutor/sampler.py (20)
Sampler (52-83)
SampleState (43-49)
should_provide_draft_probs (81-83)
should_provide_draft_probs (1413-1415)
beam_width (72-75)
is_generation_model (78-79)
is_generation_model (109-110)
is_generation_model (162-163)
is_generation_model (718-719)
is_generation_model (1449-1450)
sample_async (63-65)
sample_async (93-96)
sample_async (134-141)
sample_async (940-956)
sample_async (1596-1684)
update_requests (68-69)
update_requests (99-106)
update_requests (144-159)
update_requests (901-925)
update_requests (1688-1701)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests (18-39)
all_requests (38-39)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
LlmRequest (282-422)
append (79-98)
append (125-142)
generation_logits (214-223)
tensorrt_llm/_torch/pyexecutor/handle_logits.py (1)
HandleLogits (11-81)
tensorrt_llm/_torch/speculative/mtp.py (2)
sample_async (271-318)
update_requests (245-269)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/sampler.py

1332-1332: Line too long (125 > 120)

(E501)

1333-1333: Line too long (122 > 120)

(E501)

🔇 Additional comments (18)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
366-373:Type annotation addition: LGTM
optional 1D cache for embedding bias is clearly typed and initialized.
tensorrt_llm/_torch/modules/rms_norm.py (1)
49-53:No Ellipsis Passed toRMSNorm Residual
I ran targeted searches for any use of the literal... passed—either positionally or viaresidual=...—toRMSNorm (and its implicit__call__ toforward) and found zero occurrences. This confirms that no call sites propagateEllipsis into the newresidual=None API. The change is safe to merge.
tensorrt_llm/_torch/pyexecutor/_util.py (1)
593-596:No change needed:engine.max_seq_len already covers both KV-cache and no-cache cases
Themax_seq_len property onTRTLLMEngine is implemented to:
return the KVCacheManager’s maximum sequence length when KV caching is enabled
fall back to the internally stored_max_seq_len_storage for no-cache attention
Thus, passingengine.max_seq_len intocreate_torch_sampler_args already applies the correct cap in both scenarios—no further modifications are required here.
Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/speculative/model_drafter.py (6)
59-59:LGTM! Improved clarity
The error message change from f-string to plain string is appropriate since there are no variables to interpolate.
85-86:Good comment explaining overlap scheduling
The comment explainingexclude_last_generation_logits=True for overlap scheduling provides valuable context.
268-285:Simplified sampling logic with proper error handling
The refactored_sample_async method:
Correctly computesnum_context_logits_prefix_sum accounting forpy_return_context_logits
Properly delegates toHandleLogits and the sampler'ssample_async
Has appropriate error handling that returnsNone on failure
The removal of optional sampler handling simplifies the control flow.
300-301:LGTM! Simplified request update logic
The removal of the None-check guard for sampler is consistent with the overall simplification where the sampler is now always required.
317-317:Properly handles None propagation for draft logits
The comment indicates thatNone values are correctly forwarded when draft logits are not available, which maintains backward compatibility.
76-90:Draft logit policy integration verified
Theshould_provide_draft_probs API is declared inSampler (returnsTrue by default) and correctly overridden inTRTLLMSampler to disable draft logits under greedy sampling. A search shows its sole invocation for draft requests inmodel_drafter.py, ensuring consistent, per-request logic without residual flags.
tensorrt_llm/_torch/pyexecutor/sampler.py (9)
81-83:Conservative default for draft probability handling
The baseshould_provide_draft_probs method returnsTrue as a conservative default, ensuring backward compatibility. Derived classes can override for optimization.
276-285:Efficient greedy search with optional softmax filtering
Thesoftmax_indices parameter enables selective softmax computation, skipping unnecessary calculations for greedy sampling - a nice performance optimization.
288-310:Fixed potential issue with padded draft tokens
Good catch! The comment at line 291-292 correctly notes thatdraft_probs is not padded whiledraft_tokens may be padded byModelDrafter._pad_to_max_draft_tokens. The implementation correctly handles this by slicingdraft_tokens[:num_draft_tokens].
422-477:Excellent implementation of multi-arange utility
Thetorch_multi_arange function provides an efficient batched implementation with clear documentation and algorithm explanation. The use oftorch.repeat_interleave() andtorch.cumsum() to avoid loops is clever.
767-767:Good device consistency check
Adding the assertion ensures the generator is on the expected device, preventing subtle CUDA errors.
965-1050:Sophisticated embedding bias application
The_apply_embedding_bias implementation:
Handles per-request bias tensors efficiently
Uses smart batching to minimize memory operations
Includes detailed comments explaining the trade-offs
Correctly uses non-blocking transfers
The complexity is justified given the performance considerations.
1192-1286:Complex but necessary unbatching logic
The_unbatch_sampling_results method correctly handles:
Dimension ordering verification
Efficient scatter operations using custom indexers
Proper handling of log probabilities with different tensor layouts
Non-blocking D2H transfers for performance
The FIXMEs at lines 1256 and 1260 suggest potential optimizations for memory layout that could be addressed in future iterations.
1344-1410:Main processing pipeline successfully refactored
The_process_requests method integrates all the batched sampling components effectively:
Computes per-request steps
Selects generated logits
Applies embedding bias
Performs batched sampling by strategy
Unbatches results
This is the core of the performance improvements replacing the old "fast_path" code.
1413-1415:Good optimization for greedy sampling
The override correctly avoids requesting draft probabilities for greedy sampling, which don't require rejection sampling.

tensorrt_llm/_torch/speculative/eagle3.pyShow resolvedHide resolved

ixlmar force-pushed thefeat/batched-sampling branch 2 times, most recently from2b209dc to6e89d3bCompare

August 27, 2025 10:37

ixlmar requested a review fromdcampora

August 27, 2025 10:38

ixlmar force-pushed thefeat/batched-sampling branch from6e89d3b tof750bbaCompare

August 27, 2025 10:43

Copy link

CollaboratorAuthor

ixlmar commentedAug 27, 2025

/bot run

ixlmar changed the title~~[TRTLLM-7156, TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler)~~[TRTLLM-7156][TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler)

Copy link

Collaborator

tensorrt-cicd commentedAug 27, 2025

PR_Github #16682 [ run ] triggered by Bot

ixlmar changed the title~~[TRTLLM-7156][TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler)~~[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, TRTLLM-7156)

ixlmar changed the title~~[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, TRTLLM-7156)~~[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156)

coderabbitaibot reviewed

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tensorrt_llm/_torch/speculative/model_drafter.py (1)

1-1:Add required NVIDIA copyright header
Per repo guidelines, prepend the NVIDIA copyright header (current year).
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. from __future__ import annotations

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

288-310:get_rejected_indices(): fix index dtypes and shape bug for single-step case

Index tensors must be int64.
Removing squeeze(0) avoids breaking 2D indexing when num_steps == 1.

Apply:

-    token_idx = torch.arange(num_draft_tokens,-                             dtype=torch.int32,-                             device=generator.device)-    draft_tokens_cuda = torch.tensor(draft_tokens, dtype=torch.int32).to(+    token_idx = torch.arange(num_draft_tokens,+                             dtype=torch.long,+                             device=generator.device)+    draft_tokens_cuda = torch.tensor(draft_tokens, dtype=torch.long).to(         device=generator.device, non_blocking=True)     p = draft_probs[token_idx, draft_tokens_cuda]-    q = target_probs.squeeze(0)[token_idx, draft_tokens_cuda]+    q = target_probs[token_idx, draft_tokens_cuda]

763-777:Support multi-device by storing per-device RNG

Asserting a single device can break in multi-GPU contexts. Keep one deterministic generator per device.

Apply:

     def __init__(self, args: Args):@@-        self._generator = None+        self._generators: dict[torch.device, torch.Generator] = {}@@-    def get_generator(self, device: torch.device) -> torch.Generator:+    def get_generator(self, device: torch.device) -> torch.Generator:@@-        if self._generator is None:-            # Fallback to a default seed if not set-            self._generator = torch.Generator(device=device)-            self._generator.manual_seed(self._global_seed)-        assert self._generator.device == device-        return self._generator+        gen = self._generators.get(device)+        if gen is None or gen.device != device:+            gen = torch.Generator(device=device)+            gen.manual_seed(self._global_seed)+            self._generators[device] = gen+        return gen

🧹 Nitpick comments (2)

tensorrt_llm/_torch/speculative/model_drafter.py (1)
315-319:Forwarding generation logits: guard against tensor-device mismatches
Assigning req.py_result.generation_logits directly to target_model_req.py_draft_logits is fine; ensure both tensors reside on the same device (expected CUDA). If PyResult ever stores CPU tensors (e.g., for CPU-only tests), consider moving to the target device or documenting the assumption.
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
860-898:Consistent attribute: use py_seq_slot instead of seq_slot
Elsewhere the code uses py_seq_slot; mixing both can cause subtle bugs if they ever diverge.
Apply:
-            new_tokens[i, request.seq_slot, self.BEAM] = new_token+            new_tokens[i, request.py_seq_slot, self.BEAM] = new_token@@-            new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token+            new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between6e89d3b andf750bba.

📒 Files selected for processing (12)

tensorrt_llm/_torch/auto_deploy/llm_args.py (0 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (0 hunks)
tensorrt_llm/_torch/modules/rms_norm.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (0 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (23 hunks)
tensorrt_llm/_torch/speculative/eagle3.py (2 hunks)
tensorrt_llm/_torch/speculative/model_drafter.py (6 hunks)
tensorrt_llm/llmapi/llm_args.py (0 hunks)
tensorrt_llm/scaffolding/worker.py (0 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)

💤 Files with no reviewable changes (5)

tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/scaffolding/worker.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

🚧 Files skipped from review as they are similar to previous changes (5)

tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/speculative/eagle3.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/modules/rms_norm.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/speculative/model_drafter.py
tensorrt_llm/_torch/pyexecutor/sampler.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/speculative/model_drafter.py
tensorrt_llm/_torch/pyexecutor/sampler.py

🧬 Code graph analysis (2)

tensorrt_llm/_torch/speculative/model_drafter.py (6)

tensorrt_llm/_torch/pyexecutor/sampler.py (20)
Sampler (52-83)
SampleState (43-49)
should_provide_draft_probs (81-83)
should_provide_draft_probs (1433-1435)
beam_width (72-75)
is_generation_model (78-79)
is_generation_model (109-110)
is_generation_model (162-163)
is_generation_model (727-728)
is_generation_model (1469-1470)
sample_async (63-65)
sample_async (93-96)
sample_async (134-141)
sample_async (951-967)
sample_async (1616-1704)
update_requests (68-69)
update_requests (99-106)
update_requests (144-159)
update_requests (911-935)
update_requests (1708-1721)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests (18-39)
all_requests (38-39)
tensorrt_llm/llmapi/llm_args.py (1)
DecodingBaseConfig (349-404)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
LlmRequest (282-424)
append (79-98)
append (125-142)
generation_logits (214-223)
tensorrt_llm/_torch/pyexecutor/handle_logits.py (1)
HandleLogits (11-81)
tensorrt_llm/_torch/speculative/mtp.py (2)
sample_async (271-318)
update_requests (245-269)

tensorrt_llm/_torch/pyexecutor/sampler.py (3)

tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
LlmRequest (282-424)
get_draft_token_length (558-569)
tensorrt_llm/_torch/speculative/mtp.py (1)
sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests (18-39)
all_requests (38-39)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/sampler.py

1353-1353: Line too long (125 > 120)

(E501)

🔇 Additional comments (3)

tensorrt_llm/_torch/speculative/model_drafter.py (1)
271-282:Prefix-sum logic: confirm 1-logit accounting when context logits are disabled
You count 1 per-context request when return_context_logits is False, to keep the last-token logit for potential generation logits. Please verify this matches upstream model output layout for all decoders.
Would you like a quick script to assert that lengths of num_context_logits_prefix_sum align with actual logits slices during tests?
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
1433-1436:Good: draft probs suppressed for greedy
Skipping draft probabilities when greedy is selected avoids unnecessary softmax work.
166-173:Use LongTensor for indices and adjust type hints
Indexing tensors in PyTorch requires int64 (LongTensor). Using IntTensor/int32 for indices will raise runtime errors in advanced indexing and index_select.
Apply:
-def top_k_sampling_batch(+def top_k_sampling_batch(     logits,     top_k=50,     generator: Optional[torch.Generator] = None ) -> tuple[torch.Tensor, torch.Tensor]:
And below (see related comments) ensure all index tensors are int64; also change any annotations using torch.IntTensor to torch.Tensor and cast to long at use sites.
Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/pyexecutor/sampler.pyShow resolvedHide resolved

tensorrt_llm/_torch/speculative/model_drafter.py OutdatedShow resolvedHide resolved

Copy link

Collaborator

tensorrt-cicd commentedAug 27, 2025

PR_Github #16682 [ run ] completed with stateFAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12523 completed with status: 'FAILURE'

Funatiq reviewed

tensorrt_llm/_torch/modules/rms_norm.py OutdatedShow resolvedHide resolved

dcampora reviewed

tests/unittest/llmapi/test_llm_pytorch.py OutdatedShow resolvedHide resolved

dcampora reviewed

tensorrt_llm/_torch/pyexecutor/sampler.pyShow resolvedHide resolved

coderabbitaibot reviewed

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/unittest/llmapi/test_llm_pytorch.py (1)
230-253:Force deterministic greedy sampling and avoid numeric overflow in embedding bias
The test currently relies on default sampling behavior (temperature=None → C++ default 1.0f) and an extreme bias value (torch.finfo(torch.float32).max), which can lead to non-greedy sampling and potential numeric instabilities (infs or NaNs). To ensure determinism and stability across backends, please update the test as follows:
• Intests/unittest/llmapi/test_llm_pytorch.py attest_embedding_bias_with_torch_sampler_strategies() (around line 243):
Replace the use oftorch.finfo(torch.float32).max with a large but finite bias (e.g.1e6).
Addtemperature=0.0 (to force greedy sampling) andtop_k=1 (to restrict choices to the top token).
Suggested diff:
 def test_embedding_bias_with_torch_sampler_strategies():@@-    vocab_size_padded = 32000-    embedding_bias = torch.zeros(vocab_size_padded)-    embedding_bias[biased_word_id] = torch.finfo(torch.float32).max+    vocab_size_padded = 32000  # consider deriving from tokenizer/model if it varies+    embedding_bias = torch.zeros(vocab_size_padded)+    embedding_bias[biased_word_id] = 1e6  # large finite bias to avoid overflow@@-    sampling_kwargs = {-        "max_tokens": 6,-        "embedding_bias": embedding_bias,-    }+    sampling_kwargs = {+        "max_tokens": 6,+        "temperature": 0.0,   # force deterministic greedy sampling+        "top_k": 1,           # restrict to the highest-scoring token+        "embedding_bias": embedding_bias,+    }
These changes will make the test behavior consistent and robust across different PyTorch builds and hardware.
tensorrt_llm/_torch/modules/rms_norm.py (1)
63-71:Inconsistent residual semantics between FLASHINFER and non-FLASH paths.
Non-FLASH returns updated residual (pre-norm sum), FLASHINFER returns the original residual. This diverges behavior and can corrupt downstream fused-add pipelines.
Apply:
         if IS_FLASHINFER_AVAILABLE:             from ..custom_ops import (flashinfer_fused_add_rmsnorm,                                       flashinfer_rmsnorm)-            if residual is not None:-                flashinfer_fused_add_rmsnorm(hidden_states, residual,-                                             self.weight, self.variance_epsilon)+            if residual is not None:+                # Preserve parity with non-FLASH path: return pre-norm sum as residual+                input_dtype = hidden_states.dtype+                _updated_residual = (hidden_states.to(torch.float32) ++                                     residual.to(torch.float32)).to(input_dtype)+                flashinfer_fused_add_rmsnorm(hidden_states, residual,+                                             self.weight, self.variance_epsilon)+                residual = _updated_residual             else:                 hidden_states = flashinfer_rmsnorm(hidden_states, self.weight,                                                    self.variance_epsilon)

🧹 Nitpick comments (2)

tests/unittest/llmapi/test_llm_pytorch.py (1)
234-241:Consider deriving vocab_size_padded instead of hard-coding 32000.
Hard-coding may break with model/tokenizer variants that pad differently.
I can propose a helper to compute padded size from tokenizer vocab and alignment used by the sampler if you confirm the pad multiple.
tensorrt_llm/_torch/modules/rms_norm.py (1)
75-78:Clarify intent: residual holds pre-norm sum (not original residual).
This is correct for fused-add pipelines; consider a brief comment to avoid future regressions.
Apply:
-            if residual is not None:+            if residual is not None:+                # Update residual to the pre-norm sum(hidden_states, residual) for downstream fused-add                 hidden_states = hidden_states + residual.to(torch.float32)                 residual = hidden_states.to(input_dtype)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and betweenf750bba and8d6a5a6.

📒 Files selected for processing (2)

tensorrt_llm/_torch/modules/rms_norm.py (5 hunks)
tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/modules/rms_norm.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/modules/rms_norm.py

🧬 Code graph analysis (2)

tests/unittest/llmapi/test_llm_pytorch.py (1)

tests/unittest/llmapi/test_llm.py (1)
llm_test_harness (109-136)

tensorrt_llm/_torch/modules/rms_norm.py (3)

tensorrt_llm/_torch/models/modeling_utils.py (1)
skip_forward (148-159)
tensorrt_llm/_torch/modules/layer_norm.py (1)
skip_forward (102-120)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
skip_forward (23-34)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/modules/rms_norm.py (1)
63-71:In-place semantics confirmed for flashinfer_fused_add_rmsnorm
Verified thatflashinfer_fused_add_rmsnorm is registered as a custom op withmutates_args=("input", "residual") and its Python signature returnsNone, meaning it mutateshidden_states (andresidual) in-place. No changes required.

tensorrt_llm/_torch/modules/rms_norm.pyShow resolvedHide resolved

Copy link

CollaboratorAuthor

ixlmar commentedAug 27, 2025

/bot run

Copy link

Collaborator

tensorrt-cicd commentedAug 27, 2025

PR_Github #16697 [ run ] triggered by Bot

coderabbitaibot reviewed

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between8d6a5a6 and12e937b.

📒 Files selected for processing (2)

tests/integration/test_lists/test-db/l0_dgx_b200.yml (2 hunks)
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/integration/test_lists/test-db/l0_dgx_b200.yml

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nvPR: NVIDIA/TensorRT-LLM#7192File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72Timestamp: 2025-08-26T09:49:04.956ZLearning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/integration/test_lists/test-db/l0_dgx_b200.yml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
97-101:LGTM: standardized lowercase backend specifiers for GPTOSS post_merge.
Consistent with the broader rename pattern in this PR.
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
154-159:LGTM: backend tag casing normalized (CUTLASS/TRITON → cutlass/triton).
Matches the naming scheme adopted elsewhere; should reduce param-id drift across files.

tests/integration/test_lists/test-db/l0_dgx_b200.yml OutdatedShow resolvedHide resolved

coderabbitaibot reviewed

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/modules/rms_norm.py (1)

64-73:Residual return semantics diverge between FLASHINFER and fallback paths.

Fallback returns the pre-norm sum asresidual (Line 78), while the FLASHINFER branch returns the original inputresidual unchanged, breaking parity and likely confusing downstream consumers expecting the updated residual. Fix by computing the summed residual for return in the FLASHINFER path without altering the inputs to the fused op.

         if IS_FLASHINFER_AVAILABLE:             from ..custom_ops import (flashinfer_fused_add_rmsnorm,                                       flashinfer_rmsnorm)-            if residual is not None:-                flashinfer_fused_add_rmsnorm(hidden_states, residual,-                                             self.weight, self.variance_epsilon)+            if residual is not None:+                # Maintain the same return semantics as the non-FLASHINFER path:+                # return the pre-norm sum as residual.+                input_dtype = hidden_states.dtype+                residual_out = (hidden_states.to(torch.float32)+                                + residual.to(torch.float32)).to(input_dtype)+                flashinfer_fused_add_rmsnorm(hidden_states,  # normalized in-place/return+                                             residual,+                                             self.weight,+                                             self.variance_epsilon)+                residual = residual_out             else:                 hidden_states = flashinfer_rmsnorm(hidden_states, self.weight,                                                    self.variance_epsilon)

🧹 Nitpick comments (4)

tensorrt_llm/_torch/modules/rms_norm.py (4)

28-30:Make sentinel definition clearer and immutable.

Minor readability/typing tweak: define the alias before the constant and mark the sentinelFinal so type-checkers treat it as immutable.

-from typing import Optional, Tuple, TypeAlias, Union, cast+from typing import Final, Optional, Tuple, TypeAlias, Union, cast@@-    _ARGUMENT_NOT_SPECIFIED_SENTINEL = ...-    _ArgumentNotSpecifiedSentinelType: TypeAlias = EllipsisType+    _ArgumentNotSpecifiedSentinelType: TypeAlias = EllipsisType+    _ARGUMENT_NOT_SPECIFIED_SENTINEL: Final[_ArgumentNotSpecifiedSentinelType] = ...

55-63:Streamline the “unspecified residual” logic.

Equivalent but tighter boolean makes intent obvious and avoids the extra flag mutation.

-        return_residual = True-        if residual is self._ARGUMENT_NOT_SPECIFIED_SENTINEL:-            return_residual = False-            residual = None+        return_residual = residual is not self._ARGUMENT_NOT_SPECIFIED_SENTINEL+        if not return_residual:+            residual = None

76-79:Avoid an extra allocation in the fallback add.

You can do the add in-place on the FP32 buffer to reduce one temporary tensor.

-            if residual is not None:-                hidden_states = hidden_states + residual.to(torch.float32)-                residual = hidden_states.to(input_dtype)+            if residual is not None:+                hidden_states.add_(residual.to(torch.float32))+                residual = hidden_states.to(input_dtype)

Note: ensure no autograd requirements/in-place aliasing concerns in this inference path.

110-116:Fix mutable default and Optional types in group_rms_norm signature.

weights uses a mutable default[], andeps/weight_bias are annotatedOptional but never acceptNone. Clean this to avoid surprises.

-def group_rms_norm(-        inputs: list[torch.Tensor],-        weights: Optional[list[torch.Tensor]] = [],-        eps: Optional[float] = 1e-5,-        weight_bias: Optional[float] = 0.0,+def group_rms_norm(+        inputs: list[torch.Tensor],+        weights: Optional[list[torch.Tensor]] = None,+        eps: float = 1e-5,+        weight_bias: float = 0.0,         kernel: GroupRMSNormKernelSelection = GroupRMSNormKernelSelection.heuristic,         outputs: Optional[list[torch.Tensor]] = None) -> list[torch.Tensor]:@@-    out = outputs+    out = outputs     if out is None:         out = [torch.empty_like(input) for input in inputs]+    if weights is None:+        weights = []

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between12e937b and72ae320.

📒 Files selected for processing (1)

tensorrt_llm/_torch/modules/rms_norm.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/modules/rms_norm.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/modules/rms_norm.py

🧠 Learnings (3)

📚 Learning: 2025-08-27T14:41:56.613Z

Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/modules/rms_norm.py:96-99Timestamp: 2025-08-27T14:41:56.613ZLearning: In tensorrt_llm/_torch/modules/rms_norm.py, the RMSNorm class uses a custom sentinel (_ARGUMENT_NOT_SPECIFIED_SENTINEL) instead of Ellipsis (...) for detecting unspecified optional arguments. Other modules in the codebase may use Ellipsis as a sentinel but do not forward it to RMSNorm methods, so there's no need for backward compatibility with Ellipsis in RMSNorm.

Applied to files:

tensorrt_llm/_torch/modules/rms_norm.py

📚 Learning: 2025-08-27T14:23:55.545Z

Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/modules/rms_norm.py:17-17Timestamp: 2025-08-27T14:23:55.545ZLearning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

tensorrt_llm/_torch/modules/rms_norm.py

📚 Learning: 2025-08-25T22:42:47.612Z

Learnt from: CRPR: NVIDIA/TensorRT-LLM#0File: CODING_GUIDELINES.md:0-0Timestamp: 2025-08-25T22:42:47.612ZLearning: Applies to **/*.py : Code must target Python 3.8+

Applied to files:

tensorrt_llm/_torch/modules/rms_norm.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/modules/rms_norm.py (3)

tensorrt_llm/_torch/models/modeling_utils.py (1)
skip_forward (148-159)
tensorrt_llm/_torch/modules/layer_norm.py (1)
skip_forward (102-120)
tensorrt_llm/_torch/modules/decoder_layer.py (1)
skip_forward (23-34)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/_torch/modules/rms_norm.py (3)
17-18:Python 3.10+ typing usage is fine; ensure packaging/CI reflect this floor.
UsingTypeAlias andEllipsisType is correct given the project’s effective Python ≥3.10 policy. Please confirmpython_requires and CI images align (≥3.10) to avoid drift.
85-88:Return contract looks good.
Tuple is returned only when the caller explicitly provided aresidual (includingNone), matching the sentinel contract.
93-101:skip_forward parity with sentinel handling is correct.
Behavior mirrorsforward: unspecified → single tensor; specified (evenNone) → tuple. Good.

Copy link

CollaboratorAuthor

ixlmar commentedAug 27, 2025

/bot run

Copy link

Collaborator

tensorrt-cicd commentedAug 27, 2025

PR_Github #16697 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12532 completed with status: 'FAILURE'

Copy link

Collaborator

tensorrt-cicd commentedAug 27, 2025

PR_Github #16712 [ run ] triggered by Bot

coderabbitaibot reviewed

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/pyexecutor/sampler.py (2)

769-783:Make generators per-device; current assert breaks multi-GPU

A single cached generator with a hard device-assert will fail when sampling across devices.

-        self._global_seed = 42-        self._generator = None+        self._global_seed = 42+        self._generators: dict[torch.device, torch.Generator] = {}@@-    def get_generator(self, device: torch.device) -> torch.Generator:+    def get_generator(self, device: torch.device) -> torch.Generator:@@-        if self._generator is None:-            # Fallback to a default seed if not set-            self._generator = torch.Generator(device=device)-            self._generator.manual_seed(self._global_seed)-        assert self._generator.device == device-        return self._generator+        gen = self._generators.get(device)+        if gen is None:+            gen = torch.Generator(device=device)+            gen.manual_seed(self._global_seed)+            self._generators[device] = gen+        return gen

885-897:Use py_seq_slot consistently (fixes wrong writes with binding seq_slot)

TorchSampler elsewhere uses py_seq_slot; mixing in seq_slot can misroute writes.

-            new_tokens[i, request.seq_slot, self.BEAM] = new_token+            new_tokens[i, request.py_seq_slot, self.BEAM] = new_token@@-            new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token+            new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token

Also applies to: 899-904

♻️ Duplicate comments (3)

tensorrt_llm/_torch/pyexecutor/sampler.py (3)

374-397:Int32 softmax_indices and filtering behavior LGTM

Type hint and greedy softmax-skipping logic align with established TRT-LLM conventions.

1-2:Add required NVIDIA copyright header

Per repo guidelines, prepend the NVIDIA header.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.+ import dataclasses import enum

1018-1064:_apply_embedding_bias(): use id()-based dedup and Long indices for index_select

Tensor keys are unsafe/expensive; index_select requires Long indices.

-        # Indices of unique bias tensors-        bias_to_index: dict[torch.Tensor,-                            int] = defaultdict(provision_bias_index)+        # Indices of unique bias tensors (group by identity, not contents)+        bias_to_index: dict[int, int] = defaultdict(provision_bias_index)+        biases: list[torch.Tensor] = []@@-            if req_bias is not None:+            if req_bias is not None:                 logits_bias_mask[i:(i + steps)] = True-                req_bias_index = bias_to_index[req_bias]+                key = id(req_bias)+                req_bias_index = bias_to_index[key]+                if req_bias_index == len(biases):+                    biases.append(req_bias)                 bias_gather_indices.extend(repeat(req_bias_index, steps))@@-        bias_gather_indices_cuda = torch.tensor(bias_gather_indices,-                                                pin_memory=True,-                                                dtype=torch.int32).to(-                                                    logits.device,-                                                    non_blocking=True)+        bias_gather_indices_cuda = torch.tensor(+            bias_gather_indices, pin_memory=True, dtype=torch.long+        ).to(logits.device, non_blocking=True)@@-        biases_tensor = torch.empty((len(bias_to_index), *req_bias.shape),-                                    pin_memory=True)-        biases_tensor = torch.stack(-            tuple(bias_to_index.keys()),-            out=biases_tensor,-        )+        biases_tensor = torch.stack(+            biases, out=torch.empty((len(biases), *req_bias.shape), pin_memory=True)+        )@@-        biases_tensor_cuda = torch.index_select(biases_tensor_cuda, 0,-                                                bias_gather_indices_cuda)+        biases_tensor_cuda = torch.index_select(+            biases_tensor_cuda, 0, bias_gather_indices_cuda+        )

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/sampler.py (1)
39-56:Confirm minimum supported Python version (uses 3.10+ features)
This file uses kw_only dataclasses, PEP 634 pattern matching (match/case), and PEP 604 unions (X | Y), which require Python 3.10+. Repo guideline says “Python 3.8+”; please confirm runtime targets 3.10+ or adjust.
Would you like a codemod to replace match/case with if/elif and kw_only with runtime-safe alternatives if 3.8/3.9 must be supported?
Also applies to: 119-133, 374-397

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between72ae320 andfae40c5.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/sampler.py (24 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/pyexecutor/sampler.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/pyexecutor/sampler.py

🧠 Learnings (9)

📚 Learning: 2025-08-25T22:42:47.612Z

Learnt from: CRPR: NVIDIA/TensorRT-LLM#0File: CODING_GUIDELINES.md:0-0Timestamp: 2025-08-25T22:42:47.612ZLearning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagamPR: NVIDIA/TensorRT-LLM#6487File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12Timestamp: 2025-08-06T13:58:07.506ZLearning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-27T15:03:57.127Z

Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392Timestamp: 2025-08-27T15:03:57.127ZLearning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-27T14:52:26.324Z

Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:985-1064Timestamp: 2025-08-27T14:52:26.324ZLearning: In TensorRT-LLM's _apply_embedding_bias() method, tensor deduplication should use id() instead of the tensor itself as a dict key to avoid scanning tensor contents and for better performance.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-27T14:47:49.339Z

Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:1229-1238Timestamp: 2025-08-27T14:47:49.339ZLearning: torch.Tensor.dim_order is a public API in PyTorch that returns the dimension order of a tensor, documented at https://docs.pytorch.org/docs/stable/generated/torch.Tensor.dim_order.html. It supports an ambiguity_check parameter that can be a boolean or list of memory formats.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartierPR: NVIDIA/TensorRT-LLM#6763File: tests/integration/defs/triton_server/conftest.py:16-22Timestamp: 2025-08-11T20:09:24.389ZLearning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsianPR: NVIDIA/TensorRT-LLM#7101File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36Timestamp: 2025-08-21T00:16:56.457ZLearning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidiaPR: NVIDIA/TensorRT-LLM#6506File: examples/models/core/mixtral/requirements.txt:3-3Timestamp: 2025-08-01T15:14:45.673ZLearning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/pyexecutor/sampler.py (3)

tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
LlmRequest (282-424)
get_draft_token_length (558-569)
tensorrt_llm/_torch/speculative/mtp.py (1)
sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests (18-39)
all_requests (38-39)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/sampler.py

1359-1359: Line too long (125 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/pyexecutor/sampler.py (1)
1235-1244:dim_order() usage OK
Public API; assertions ensure canonical layout before view/scatter.

coderabbitaibot reviewed

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/sampler.py (1)
885-897:Usepy_seq_slot consistently;seq_slot may be different/undefined here.
Other TorchSampler paths usepy_seq_slot. Mixing the two risks indexing the wrong slot or AttributeError.
Apply:
-            new_tokens[i, request.seq_slot, self.BEAM] = new_token+            new_tokens[i, request.py_seq_slot, self.BEAM] = new_token@@-            new_tokens[num_accepted, request.seq_slot, self.BEAM] = new_token+            new_tokens[num_accepted, request.py_seq_slot, self.BEAM] = new_token

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/sampler.py (1)
1-2:Add required NVIDIA copyright header (current year).
Repository guideline requires the header on all source files.
Apply:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.+ import dataclasses import enum

🧹 Nitpick comments (5)

tensorrt_llm/_torch/pyexecutor/sampler.py (5)

184-191:Guard top-k against k > vocab_size.

Prevents runtime errors when configs overspecify k.

Apply:

-    if top_k > 0:-        values, indices = torch.topk(logits, top_k, dim=-1)+    if top_k > 0:+        top_k = min(top_k, vocab_size)+        values, indices = torch.topk(logits, top_k, dim=-1)

249-256:Same guard for top-k in top-k+top-p path.

Apply:

-    if top_k > 0:-        values, _ = torch.topk(logits, top_k, dim=-1)+    if top_k > 0:+        top_k = min(top_k, vocab_size)+        values, _ = torch.topk(logits, top_k, dim=-1)

778-783:Make generator device-robust (multi-GPU).

Reinitialize when device changes instead of asserting.

Apply:

-        if self._generator is None:-            # Fallback to a default seed if not set-            self._generator = torch.Generator(device=device)-            self._generator.manual_seed(self._global_seed)-        assert self._generator.device == device+        if self._generator is None or self._generator.device != device:+            self._generator = torch.Generator(device=device)+            self._generator.manual_seed(self._global_seed)         return self._generator

1026-1066:Prefer id()-based dedup for clarity; avoid relying on tensor keys.

Identity hashing of tensors is fine, but explicit id() clarifies intent and decouples from tensor-object key semantics.

Apply:

-        bias_to_index: dict[torch.Tensor,-                            int] = defaultdict(provision_bias_index)+        bias_to_index: dict[int, int] = defaultdict(provision_bias_index)+        biases: list[torch.Tensor] = []@@-            if req_bias is not None:+            if req_bias is not None:                 logits_bias_mask[i:(i + steps)] = True-                req_bias_index = bias_to_index[req_bias]+                key = id(req_bias)+                req_bias_index = bias_to_index[key]+                if req_bias_index == len(biases):+                    biases.append(req_bias)                 bias_gather_indices.extend(repeat(req_bias_index, steps))@@-        biases_tensor = torch.empty((len(bias_to_index), *req_bias.shape),-                                    pin_memory=True)-        biases_tensor = torch.stack(-            tuple(bias_to_index.keys()),-            out=biases_tensor,-        )+        biases_tensor = torch.empty((len(biases), *req_bias.shape), pin_memory=True)+        biases_tensor = torch.stack(tuple(biases), out=biases_tensor)

52-56:Annotate optional fields as Optional[...] for type correctness.

Current defaults are None but annotations are non-optional.

Apply:

 class SampleState:     scheduled_requests: ScheduledRequests--    device: SampleStateTensors = None-    host: SampleStateTensors = None--    sampler_event: torch.cuda.Event = None+    device: Optional[SampleStateTensors] = None+    host: Optional[SampleStateTensors] = None+    sampler_event: Optional[torch.cuda.Event] = None

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and betweenfae40c5 and81660f3.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/sampler.py (24 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members ininit
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/pyexecutor/sampler.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/pyexecutor/sampler.py

🧠 Learnings (9)

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagamPR: NVIDIA/TensorRT-LLM#6487File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12Timestamp: 2025-08-06T13:58:07.506ZLearning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-25T22:42:47.612Z

Learnt from: CRPR: NVIDIA/TensorRT-LLM#0File: CODING_GUIDELINES.md:0-0Timestamp: 2025-08-25T22:42:47.612ZLearning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-27T15:03:57.127Z

Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392Timestamp: 2025-08-27T15:03:57.127ZLearning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-27T14:52:26.324Z

Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:985-1064Timestamp: 2025-08-27T14:52:26.324ZLearning: In TensorRT-LLM's _apply_embedding_bias() method, tensor deduplication should use id() instead of the tensor itself as a dict key to avoid scanning tensor contents and for better performance.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-27T14:47:49.339Z

Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:1229-1238Timestamp: 2025-08-27T14:47:49.339ZLearning: torch.Tensor.dim_order is a public API in PyTorch that returns the dimension order of a tensor, documented at https://docs.pytorch.org/docs/stable/generated/torch.Tensor.dim_order.html. It supports an ambiguity_check parameter that can be a boolean or list of memory formats.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartierPR: NVIDIA/TensorRT-LLM#6763File: tests/integration/defs/triton_server/conftest.py:16-22Timestamp: 2025-08-11T20:09:24.389ZLearning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsianPR: NVIDIA/TensorRT-LLM#7101File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36Timestamp: 2025-08-21T00:16:56.457ZLearning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

📚 Learning: 2025-08-27T15:37:38.228Z

Learnt from: ixlmarPR: NVIDIA/TensorRT-LLM#7294File: tensorrt_llm/_torch/pyexecutor/sampler.py:991-1070Timestamp: 2025-08-27T15:37:38.228ZLearning: PyTorch tensors are hashable as dict keys because hash(torch.Tensor) is equivalent to id(torch.Tensor) and uses identity-based hashing, not content-based hashing. This means tensor hashing doesn't scan tensor contents and tensors can be safely used as dictionary keys.

Applied to files:

tensorrt_llm/_torch/pyexecutor/sampler.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/pyexecutor/sampler.py (3)

tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
LlmRequest (282-424)
get_draft_token_length (558-569)
tensorrt_llm/_torch/speculative/mtp.py (1)
sample_async (271-318)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests (18-39)
all_requests (38-39)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/sampler.py

1362-1362: Line too long (125 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tensorrt_llm/_torch/pyexecutor/sampler.py (2)
359-371:Int32 indexing usage is consistent with established sampler design.
Acknowledging prior discussion: keeping int32 for index tensors here is intentional and works in this codepath.
Also applies to: 282-292, 1188-1194, 1301-1305, 430-488, 504-523, 1238-1247, 1260-1267
1229-1247:Use of torch.Tensor.dim_order is valid.
Public API; check is correct for canonical row-major ordering.

tensorrt_llm/_torch/pyexecutor/sampler.pyShow resolvedHide resolved

Copy link

Collaborator

tensorrt-cicd commentedAug 27, 2025

PR_Github #16712 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12543 completed with status: 'FAILURE'

Copy link

CollaboratorAuthor

ixlmar commentedAug 27, 2025

/bot run --disable-fail-fast

Copy link

Collaborator

tensorrt-cicd commentedAug 27, 2025

PR_Github #16722 [ run ] triggered by Bot

Copy link

Collaborator

tensorrt-cicd commentedAug 27, 2025

PR_Github #16722 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12552 completed with status: 'FAILURE'

feat: batched sampling by strategy (supersedes enable_mixed_sampler)

b0981e2

Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>

auto-merge was automatically disabled

September 22, 2025 09:39

Head branch was pushed to by a user without write access

ixlmar force-pushed thefeat/batched-sampling branch from4c98e8b tob0981e2Compare

September 22, 2025 09:39

Copy link

CollaboratorAuthor

ixlmar commentedSep 22, 2025

/bot run

Copy link

Collaborator

tensorrt-cicd commentedSep 22, 2025

PR_Github #19563 [ run ] triggered by Bot

DomBrown approved these changes

Copy link

Collaborator

DomBrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

Superjomn approved these changes

Copy link

Collaborator

Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM on the llmapi changes.

syuoni approved these changes

Copy link

Collaborator

syuoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

tensorrt_llm/evaluate/json_mode_eval.pyShow resolvedHide resolved

lucaslie approved these changes

Copy link

Collaborator

tensorrt-cicd commentedSep 23, 2025

PR_Github #19563 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14707 completed with status: 'FAILURE'

Copy link

Collaborator

chzblych commentedSep 23, 2025

/bot run

Copy link

Collaborator

tensorrt-cicd commentedSep 23, 2025

PR_Github #19638 [ run ] triggered by Bot

Copy link

CollaboratorAuthor

ixlmar commentedSep 23, 2025

/bot run

ixlmar mentioned this pull request

[TRTLLM-7728][perf] improve batched sampling perf for contiguous batches#7908

Merged

1 task

Copy link

Collaborator

tensorrt-cicd commentedSep 23, 2025

PR_Github #19666 [ run ] triggered by Bot

Copy link

Collaborator

tensorrt-cicd commentedSep 23, 2025

PR_Github #19638 [ run ] completed with stateABORTED
LLM/main/L0_MergeRequest_PR #14773 (Blue Ocean) completed with status: ABORTED

litaotju approved these changes

QiJune approved these changes

dcamporaenabled auto-merge (squash)

September 23, 2025 09:03

Copy link

Collaborator

tensorrt-cicd commentedSep 23, 2025

PR_Github #19666 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14799 completed with status: 'SUCCESS'

dcampora merged commit9970345 intoNVIDIA:main