Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[None][chore] Mass integration of release/1.0 - 3rd#7519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
nv-guomingz merged 45 commits intoNVIDIA:mainfromdominicshanshan:mi-release-1.0-3
Sep 8, 2025

Conversation

@dominicshanshan
Copy link
Collaborator

@dominicshanshandominicshanshan commentedSep 4, 2025
edited
Loading

Summary by CodeRabbit

  • New Features

    • Introduced BuildConfig; LLM now accepts build_config and max_seq_len.
    • Added CUDA graph buffer reuse in attention backend to enable preallocated buffers.
  • Performance

    • Reduced GPU allocations and startup overhead via buffer caching.
    • Auto-injects max_batch_size into PyTorch cuda_graph_config when missing.
  • Bug Fixes

    • Improved accuracy/stability by disabling a problematic GEMM path for specific models.
    • Block unsupported GPUs for a fused MoE path.
    • Added stricter validation in disaggregated serving.
    • Non-blocking queue sends now retry; cleaner shutdown to avoid hangs.
  • Documentation

    • Metrics endpoint guidance updated; examples and quickstart refreshed.

Description

As discussed with some PR author,#6946@PerkzZheng will not be cherry-pick,#7191@Shixiaowei02 will not be cherry-pick,#7193 will not be cherry-pick@crazydemo,#7149@Wanli-Jiang will not be cherry-pick, those PRs already have merged in main with different PRs in main branch.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR FollowsTRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (seetest instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run/bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: DoesNOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: DoesNOT update GitHub check status.

--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: DoesNOT update GitHub check status.

--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: DoesNOT update GitHub pipeline status.

--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: DoesNOT update GitHub check status.

--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: DoesNOT update GitHub check status.

--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug(OPTIONAL) :Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list parameter to access the appropriate container environment. Note: DoesNOT update GitHub check status.

For guidance on mapping tests to stage names, seedocs/source/reference/ci-overview.md
and thescripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request.--comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitaibot commentedSep 4, 2025
edited
Loading

📝 Walkthrough
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat withCodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag@coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag@coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on oursupport page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type@coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add@coderabbitai ignore or@coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add@coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add@coderabbitai or@coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit ourStatus Page to check the current availability of CodeRabbit.
  • Visit ourDocumentation for detailed information on how to use CodeRabbit.
  • Join ourDiscord Community to get help, request features, and share feedback.
  • Follow us onX/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (1)

1-3:Add required NVIDIA copyright header (2025)

Guidelines require the NVIDIA copyright header on all .py files.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.+# SPDX-License-Identifier: Apache-2.0 import asyncio from difflib import SequenceMatcher from pathlib import Path
cpp/tensorrt_llm/batch_manager/logitsPostProcessor.cpp (1)

73-81:UB/crash: dereferencing possibly empty optionallogitsPostProcessorBatched

If any request setsmApplyLogitsPostProcessorBatched and the caller didn’t pass a batched processor (uses defaultnullopt),(*logitsPostProcessorBatched)(...) is undefined behavior.

Apply:

     // Invoke batched processor     if (!reqIdsVec.empty())     {         logitsPostProcessorIsApplied = true;-        if (replicateLogitsPostProcessor || worldConfig.isFirstTensorParallelRank())-        {-            (*logitsPostProcessorBatched)(reqIdsVec, logitsVec, beamTokensVec, stream, clientIdsVec);-        }+        if (!logitsPostProcessorBatched)+        {+            TLLM_THROW("Batched logits post-processor requested but not provided.");+        }+        if (replicateLogitsPostProcessor || worldConfig.isFirstTensorParallelRank())+        {+            (*logitsPostProcessorBatched)(reqIdsVec, logitsVec, beamTokensVec, stream, clientIdsVec);+        }     }
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

246-257:Data race: reading mReadyResponses without holding its mutex

!mReadyResponses.empty() is checked withoutmResponderMutex. Access tomReadyResponses must be protected consistently.

-                if (!isSending() && !mReadyResponses.empty())+                bool hasReady = false;+                {+                    std::unique_lock lk(mResponderMutex);+                    hasReady = !mReadyResponses.empty();+                }+                if (!isSending() && hasReady)                 {                     auto const& requestInfo = mSender->recvRequestInfo();
tests/integration/defs/disaggregated/test_disaggregated.py (2)

297-301:Avoid masking exceptions if process startup fails.

server_proc/workers_proc may be undefined if context setup throws, causing UnboundLocalError in finally.

-    finally:-        server_proc.terminate()-        workers_proc.terminate()-        server_proc.wait()-        workers_proc.wait()+    finally:+        sp = locals().get("server_proc")+        wp = locals().get("workers_proc")+        if sp is not None:+            sp.terminate()+            sp.wait()+        if wp is not None:+            wp.terminate()+            wp.wait()

1289-1293:Same cleanup robustness for benchmark runner.

Mirror the safe termination pattern here as well.

-    finally:-        server_proc.terminate()-        workers_proc.terminate()-        server_proc.wait()-        workers_proc.wait()+    finally:+        sp = locals().get("server_proc")+        wp = locals().get("workers_proc")+        if sp is not None:+            sp.terminate()+            sp.wait()+        if wp is not None:+            wp.terminate()+            wp.wait()
tensorrt_llm/llmapi/mpi_session.py (1)

418-449:Fix race on results aggregation and avoid silent drops when send retries exhaust.

  • self.results.append(...) and the length check run in multiple callbacks without synchronization; two near-simultaneous completions can both observe the “all done” condition and send twice, or clear while the other thread is still pickling, yielding empty/partial payloads.
  • ZeroMqQueue.put_noblock now swallows EAGAIN after retries and only logs an error; the surrounding try/except won’t detect failure. Clearingself.results unconditionally can lose results and cause the client to hang indefinitely.

Apply this diff to make the callback idempotent and thread-safe, and to avoid clearing until after a successful handoff. It also aligns retries with other call sites and adds a small backoff:

@@-    def mpi_future_callback(self, future):-        print_colored_debug(f"rank{global_mpi_rank()} got future: {future}\n",-                            "red")-        if future.exception() is not None:-            print_colored_debug(-                f"mpi_future got exception: {future.exception()}, quitting\n",-                "red")-            self.queue.put(future.exception())-            return--        result = future.result()-        self.results.append(result)-        print_colored_debug(-            f"RemoteMpiCommSessionServer working status: {len(self.results)}/{self.num_results}\n",-            "grey")-        if len(self.results) == self.num_results:-            print_colored_debug(-                f"RemoteMpiCommSessionServer received all results, sending to client\n",-                "green")-            try:-                self.queue.put_noblock(self.results, retry=2)-            except zmq.ZMQError as e:-                # The client could be shutdown first.-                if e.errno == zmq.EAGAIN:-                    pass-                else:-                    raise e--            print_colored_debug(-                f"RemoteMpiCommSessionServer sent results to client\n", "green")-            self.results.clear()+    def mpi_future_callback(self, future):+        print_colored_debug(f"rank{global_mpi_rank()} got future: {future}\n", "red")++        to_send = None+        # Aggregate results and decide send-once under a lock+        with self._results_lock:+            if future.exception() is not None:+                print_colored_debug(+                    f"mpi_future got exception: {future.exception()}, quitting\n", "red")+                if not self._results_sent:+                    to_send = future.exception()+                    self._results_sent = True+            else:+                result = future.result()+                self.results.append(result)+                print_colored_debug(+                    f"RemoteMpiCommSessionServer working status: {len(self.results)}/{self.num_results}\n",+                    "grey")+                if len(self.results) == self.num_results and not self._results_sent:+                    to_send = list(self.results)  # snapshot to avoid mutation while pickling+                    self._results_sent = True++        if to_send is None:+            return++        print_colored_debug(+            f"RemoteMpiCommSessionServer sending {'exception' if not isinstance(to_send, list) else 'results'} to client\n",+            "green")+        try:+            # Slightly longer retry for robustness; keep non-blocking semantics in callback context+            self.queue.put_noblock(to_send, retry=4, wait_time=0.01)+        except zmq.ZMQError as e:+            if e.errno != zmq.EAGAIN:+                raise+        else:+            if isinstance(to_send, list):+                # Clear only after handoff attempt; we can't detect failure with current API, so rely on idempotent send-once gate.+                with self._results_lock:+                    self.results.clear()+        print_colored_debug(f"RemoteMpiCommSessionServer send complete\n", "green")

Add the following support changes outside this hunk:

# In RemoteMpiCommSessionServer.__init__ (near line ~356)self._results_lock=threading.Lock()self._results_sent=False
# In RemoteMpiCommSessionServer.serve() just before scheduling futures (near lines ~409-416)withself._results_lock:self.results= []self._results_sent=False

Optionally, consider makingZeroMqQueue.put_noblock return a bool to signal success so the server can decide whether to retain and retry later instead of unconditionally proceeding. I can draft that change if desired.

tests/integration/defs/perf/pytorch_model_config.py (1)

30-31:Fix Python 3.8 typing: list[str] is not supported without postponed evaluation.

The repository targets Python 3.8+. Using list[str] will error at import time on 3.8. Switch to typing.List.

-def get_model_yaml_config(model_label: str,-                          lora_dirs: list[str] = None) -> dict:+from typing import Optional, List  # at file top++def get_model_yaml_config(model_label: str,+                          lora_dirs: Optional[List[str]] = None) -> dict:

Add the import near the other imports in this module if not present:

fromtypingimportOptional,List
tensorrt_llm/_torch/attention_backend/interface.py (1)

304-306:Also propagate buffers (and max_draft_tokens) to cross sub-metadata.

Without this, cross metadata won’t share the preallocated buffers and may reallocate.

-            cuda_graph_metadata.cross = cuda_graph_metadata.cross.create_cuda_graph_metadata(-                max_batch_size, True)+            cuda_graph_metadata.cross = cuda_graph_metadata.cross.create_cuda_graph_metadata(+                max_batch_size,+                sub_cross_metadata=True,+                max_draft_tokens=max_draft_tokens,+                buffers=buffers,+            )
tensorrt_llm/_torch/modules/gated_mlp.py (1)

1-3:Add NVIDIA copyright header (2025).

This file is missing the required NVIDIA copyright header for 2025.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (2)

1-5:Add NVIDIA copyright header (2025).

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

138-139:Disconnect intercommunicator on worker shutdown to prevent resource leaks.

     llm.shutdown()+    try:+        intercomm.Disconnect()+    except Exception as e:+        print(f"Worker {rank} intercomm disconnect error: {e}", flush=True)
🧹 Nitpick comments (50)
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (1)

20-20:Prefer xfail over skip to surface unexpected passes; also avoid invalid URL in reason

Using xfail keeps the test visible and reports XPASS if the bug is fixed unintentionally. Drop the "https://" prefix to avoid broken-link checkers.

-@pytest.mark.skip(reason="https://nvbugs/5470782")+@pytest.mark.xfail(reason="nvbugs/5470782", strict=False)
jenkins/L0_Test.groovy (2)

2275-2282:Refactor is fine; prefer not to reach back intovalues here

Usingvalues[2] is behaviorally equivalent tocpu_arch, but it couples this inner block to the outer list indexing. For readability and to avoid accidental drift, use the already-derivedcpu_arch.

Also, please double-check that the CUDA repo path for Ubuntu 24.04 should remainsbsa for ARM. If NVIDIA switched toarm64 for 24.04, this would break.

Apply:

-            def ubuntu_version = key.contains("UB2404") ? "ubuntu2404" : "ubuntu2204"-            def platform = values[2] == X86_64_TRIPLE ? "x86_64" : "sbsa"+            def ubuntu_version = key.contains("UB2404") ? "ubuntu2404" : "ubuntu2204"+            def platform = cpu_arch == X86_64_TRIPLE ? "x86_64" : "sbsa"

351-381:UnifydockerGPUOption naming (straydockerGpuOption is unused)

There’s a casing mismatch:dockerGpuOption is declared butdockerGPUOption is used/assigned. Keep one name to avoid confusion and accidental scope issues.

-                def dockerGpuOption = ""+                def dockerGPUOption = ""@@-                    dockerGPUOption = sh(script: """+                    dockerGPUOption = sh(script: """
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)

6-7:Fail fast on unsupported SM and include details in the error

Good guard. Two improvements:

  • Move the SM check before super().init to avoid unnecessary init on unsupported GPUs.
  • Include the detected SM in the error to aid debugging. Also confirm intent vs. docstring (“SM100 only”) since the guard only blocks SM>=120.

Apply within this block to improve the message:

-        sm_version = get_sm_version()-        if sm_version >= 120:-            raise NotImplementedError(-                "TRTLLMGenFusedMoE does not support SM120 and above.")+        sm_version = get_sm_version()+        if sm_version >= 120:+            raise NotImplementedError(+                f"TRTLLMGenFusedMoE does not support SM{sm_version} (>= 120).")

And move the check to run before super().init (outside the selected range, example placement shown):

# Place right at the start of __init__, before calling super().__init__sm_version=get_sm_version()ifsm_version>=120:raiseNotImplementedError(f"TRTLLMGenFusedMoE does not support SM{sm_version} (>= 120).")

Also applies to: 83-87

tensorrt_llm/serve/openai_disagg_server.py (1)

325-327:Strengthen ctx_request_id validation

The None-check is good; also validate type and positivity to prevent downstream issues in metrics and routing.

-        if choices[0].disaggregated_params.ctx_request_id is None:-            raise ValueError("Invalid disaggregated params in context phase response.")+        ctx_id = choices[0].disaggregated_params.ctx_request_id+        if ctx_id is None or not isinstance(ctx_id, int) or ctx_id <= 0:+            raise ValueError("Invalid ctx_request_id in context phase response.")
tensorrt_llm/_torch/model_config.py (1)

497-499:Guard against non-divisible MLP size before TP split

Dividing by tp_size without checking divisibility can silently truncate and misconfigure bindings.

-                    mlp_hidden_size = self._infer_nemotron_ffn_mult(-                    ) // self.mapping.tp_size+                    mlp_hidden_size_global = self._infer_nemotron_ffn_mult()+                    if mlp_hidden_size_global % self.mapping.tp_size != 0:+                        raise ValueError(+                            f"Inferred mlp_hidden_size {mlp_hidden_size_global} is not divisible by tp_size {self.mapping.tp_size}"+                        )+                    mlp_hidden_size = mlp_hidden_size_global // self.mapping.tp_size
tests/integration/test_lists/waives.txt (1)

291-297:Dedupe duplicate SKIP entries

  • tests/integration/test_lists/waives.txt lines 244 & 283: llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp
  • tests/integration/test_lists/waives.txt lines 264 & 291: test_e2e.py::test_ptp_quickstart_multimodal[llava-v1.6-mistral-7b-llava-v1.6-mistral-7b-hf-image-False]
    Remove one occurrence of each to keep the waiver list concise.
tensorrt_llm/bench/dataclasses/configuration.py (1)

87-104:Add NVIDIA copyright header.

Per repo guidelines, prepend the 2025 NVIDIA header at file top.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.+#+""" existing imports...
tensorrt_llm/executor/ipc.py (2)

128-158:Polish put_noblock: docstring D205 and avoid assert for validation.

  • Fix D205: add a blank line after the summary line.
  • Replaceassert withValueError (asserts can be stripped with -O).
  • Optional: use a loop instead of recursion; include queue name in the error for easier traceability.
     def put_noblock(self,                     obj: Any,                     *,                     retry: int = 1,                     wait_time: float = 0.001):-        '''-        Put an object into the queue without blocking, and retry if the send fails.-        NOTE: It won't raise any error if the send fails.+        '''+        Put an object into the queue without blocking, and retry if the send fails.++        NOTE: It won't raise any error if the send fails.          Parameters:             obj (Any): The object to send.             retry (int): The number of times to retry sending the object.             wait_time (float): The time to wait before retrying.         '''-        assert retry >= 0 and retry <= 10, "Retry must be between 0 and 10, adjust the wait_time if needed"+        if not (0 <= retry <= 10):+            raise ValueError("retry must be between 0 and 10; increase wait_time if needed")          self.setup_lazily()         with nvtx_range_debug("send", color="blue", category="IPC"):             data = pickle.dumps(obj)  # nosec B301             if self.use_hmac_encryption:                 data = self._sign_data(data)-            try:-                self.socket.send(data, flags=zmq.NOBLOCK)-            except zmq.Again:-                if retry > 0:-                    time.sleep(wait_time)-                    self.put_noblock(obj, retry=retry - 1, wait_time=wait_time)-                else:-                    logger.error(f"Failed to send object: {obj}")+            while True:+                try:+                    self.socket.send(data, flags=zmq.NOBLOCK)+                    break+                except zmq.Again:+                    if retry > 0:+                        retry -= 1+                        time.sleep(wait_time)+                        continue+                    logger.error(f"[{self.name or 'ipc'}] Failed to send object: {type(obj).__name__}")+                    break

1-10:Add NVIDIA copyright header.

This file is missing the standard header; please add.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.+# """ existing imports...
examples/llm-api/_tensorrt_engine/quickstart_example.py (2)

7-9:Prefer constructor kwargs (if supported) and keep namespaced usage

Cleaner and immutable config if BuildConfig supports kwargs; also aligns with namespaced import.

Apply:

-    build_config = BuildConfig()-    build_config.max_batch_size = 256-    build_config.max_num_tokens = 1024+    build_config = trtllm.BuildConfig(max_batch_size=256, max_num_tokens=1024)

If kwargs aren’t supported, keep attribute assignment but usetrtllm.BuildConfig().


13-14:Use namespaced symbols consistently

KeepSamplingParams and any other symbols under the module namespace per guidelines.

Apply:

-    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",-              build_config=build_config)+    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",+              build_config=build_config)@@-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)+    sampling_params = trtllm.SamplingParams(temperature=0.8, top_p=0.95)

Also applies to: 24-24

cpp/tensorrt_llm/batch_manager/logitsPostProcessor.cpp (1)

42-47:Pre-size vectors to reduce reallocations in hot path

Reserve capacity to avoid repeated allocations when many requests use batched processing.

Apply:

     // Arguments for batched processor     std::vector<LlmRequest::RequestIdType> reqIdsVec;     std::vector<LlmRequest::TensorPtr> logitsVec;     std::vector<std::reference_wrapper<LlmRequest::BeamTokens const>> beamTokensVec;     std::vector<std::optional<LlmRequest::RequestIdType>> clientIdsVec;+    reqIdsVec.reserve(inputBuffers.decoderRequests.size());+    logitsVec.reserve(inputBuffers.decoderRequests.size());+    beamTokensVec.reserve(inputBuffers.decoderRequests.size());+    clientIdsVec.reserve(inputBuffers.decoderRequests.size());

Also applies to: 66-71

cpp/include/tensorrt_llm/batch_manager/logitsPostProcessor.h (4)

50-50:Avoid const& optional with std::nullopt default; also note ABI change

Binding a const& to a temporary optional via the defaultstd::nullopt is legal but non-obvious and changes the public ABI. Passing by value is simpler, avoids lifetime surprises, and copies at most one small optional. If the intent is to avoid copying the containedstd::function, prefer pass-by-value and move inside the definition.

-        std::optional<LogitsPostProcessorBatched> const& logitsPostProcessorBatched = std::nullopt) const;+        std::optional<LogitsPostProcessorBatched> logitsPostProcessorBatched = std::nullopt) const;

Please confirm no out-of-tree consumers rely on the previous by-value signature.


1-16:Update copyright year to include 2025

Repo guideline says prepend current year. Header shows 2022-2024.

- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.+ * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

18-18:Headers should use include guards, not only#pragma once

Guideline: use guards of the formTRTLLM_<FILENAME>_H.

Example (apply at file top/bottom):

-#pragma once+#ifndef TRTLLM_LOGITSPOSTPROCESSOR_H+#define TRTLLM_LOGITSPOSTPROCESSOR_H ...+#endif // TRTLLM_LOGITSPOSTPROCESSOR_H

44-44:Constant naming style

Guideline: constants use k prefix with UPPER_SNAKE. Consider renamingname tokLOGITS_POST_PROCESSOR_NAME.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

172-176:Ensure background task termination and surface exceptions

Destructor setsmTerminate but doesn’t explicitly join/observemResponseFuture. It’s safer to wait and surface exceptions fromresponse().

     ~Impl()     {         terminate();+        if (mResponseFuture.valid()) {+            mResponseFuture.get(); // join and surface any exception+        }     }

213-221:Detached per-send threads may balloon; prefer a pool or reuse async worker

Current path spawns a detached thread per response (when parallel send is enabled). This can cause unbounded thread creation under load.

  • Reuse a fixed-size thread pool (or reuse the existing async worker).
  • Alternatively, gate parallel sends with a semaphore.
docs/source/commands/trtllm-serve/trtllm-serve.rst (2)

204-210:Grammar/clarity fixes for PyTorch metrics note

Subject–verb agreement and clarity.

-   The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend.-   Some fields, such as CPU memory usage, are not yet available for the PyTorch backend.-   Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.+   The metrics endpoint for the default PyTorch backend is in beta and is not as comprehensive as the TensorRT backend.+   Some fields (for example, CPU memory usage) are not yet available for the PyTorch backend.+   Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.

219-227:Tighten wording around polling guidance

Minor style improvements and remove passive voice.

-After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the ``/metrics`` endpoint.-Since the statistics are stored in an internal queue and removed once retrieved, it's recommended to poll the endpoint shortly after each request and store the results if needed.+After you send at least one inference request, poll the ``/metrics`` endpoint to fetch runtime-iteration statistics.+Because statistics are kept in an internal queue and removed once retrieved, poll shortly after each request and persist results if needed.
tensorrt_llm/llmapi/utils.py (4)

521-538:Prevent duplicate status tags in generated docs

If ApiParamTagger already prefixes the field description with a tag, this adds a second identical tag on the arg line. Guard to avoid duplication.

-        arg_line = f"{indent}    {field_name} ({type_str}): "-        if status := field_info.get("status", None):-            arg_line += f":tag:`{status}` "+        arg_line = f"{indent}    {field_name} ({type_str}): "+        if status := field_info.get("status", None):+            # Avoid duplicating tag if description already contains it+            if not (field_description or "").lstrip().startswith(f":tag:`{status}`"):+                arg_line += f":tag:`{status}` "

484-485:Future-proof Pydantic schema access

Prefer model_json_schema() in Pydantic v2 with a v1 fallback.

-    schema = model.schema()+    schema = (model.model_json_schema()+              if hasattr(model, "model_json_schema")+              else model.schema())

562-565:Fix D200: one-line docstring formatting

Remove leading space and use concise wording.

-        """ The main entry point to tag the api doc. """+        """The main entry point to tag the API doc."""

574-589:Handle None descriptions and idempotency in description tagging

If description is None, current code yields "None" in docs; also re-tagging can double-prefix.

-        for field_name in field_names:-            field = cls.model_fields[field_name]-            cls.model_fields[-                field_name].description = f":tag:`{tag}` {field.description}"+        for field_name in field_names:+            field = cls.model_fields[field_name]+            desc = (field.description or "").strip()+            prefix = f":tag:`{tag}` "+            if not desc.startswith(prefix):+                desc = prefix + desc+            cls.model_fields[field_name].description = desc         cls.model_rebuild(force=True)
tests/unittest/llmapi/test_utils.py (1)

31-34:Strengthen assertion to reduce false positives

Also assert that at least one specific LlmArgs field line carries the tag.

 def test_generate_api_docs_as_docstring():-    doc = generate_api_docs_as_docstring(LlmArgs)-    assert ":tag:`beta`" in doc, "the label is not generated"+    doc = generate_api_docs_as_docstring(LlmArgs)+    assert ":tag:`beta`" in doc, "the label is not generated"+    # Optional: check a representative field is tagged+    assert re.search(r"\s+world_size.*:.*:tag:`beta`", doc), "world_size not tagged"
tests/integration/defs/test_e2e.py (1)

1746-1753:Advance regex cursor using match.end() to avoid re-scan

Slicing from the end of group(1) re-scans the inner text; using match.end() is cleaner and marginally faster.

-            _, end = match.span(1)-            results.append(match.group(2))-            item = item[end:]+            results.append(match.group(2))+            item = item[match.end():]
tensorrt_llm/_torch/autotuner.py (2)

372-375:Log dedup key scope (optional).

If you want one warning per (op, shape) instead of per op, include shapes in the key. Current change is correct; this is optional.

-                logger.warning_once(-                    f"[Autotuner] Using the fallback tactic, due to cache miss on input shapes={input_shapes}",-                    key=custom_op)+                logger.warning_once(+                    f"[Autotuner] Using the fallback tactic, due to cache miss on input shapes={input_shapes}",+                    key=("cache_miss", custom_op, input_shapes))

1-1:License header consistency (nit).

Other Python files carry the NVIDIA SPDX header; this one doesn’t. Consider adding for consistency/compliance.

tensorrt_llm/executor/proxy.py (1)

352-355:Optional: tune wait_time to reduce spin.

Non-blocking retries default to 1ms sleeps. If shutdown storms persist, consider a slightly larger wait_time (e.g., 5–10ms).

-            self.request_queue.put_noblock(None, retry=4)+            self.request_queue.put_noblock(None, retry=4, wait_time=0.005)
tensorrt_llm/logger.py (1)

111-116:Avoid using assert for runtime argument validation.

asserts can be stripped with -O; prefer an explicit exception for API safety.

-    def log_once(self, severity, *msg, key):-        assert key is not None, "key is required for log_once"+    def log_once(self, severity, *msg, key):+        if key is None:+            raise ValueError("key is required for log_once")         if key not in self._appeared_keys:             self._appeared_keys.add(key)             self.log(severity, *msg)
tensorrt_llm/_torch/models/modeling_qwen3.py (2)

161-178:Avoid hard-coding deep_gemm disable; make it configurable or scoped.

Hard-codingdisable_deep_gemm = True globally may regress performance where the bug doesn’t manifest (e.g., non-FP8, non-SM100). Prefer a config/env toggle so it can be flipped without code changes.

Minimal change using an env var (defaults to disabled to preserve current behavior):

@@-        # Qwen3 has accuracy issues with deep_gemm (see: https://nvbugspro.nvidia.com/bug/5461712)-        disable_deep_gemm = True+        # Qwen3 has accuracy issues with deep_gemm (see: https://nvbugspro.nvidia.com/bug/5461712)+        # Allow override via env; default to disabling for safety.+        disable_deep_gemm = os.getenv("TLLM_DISABLE_DEEP_GEMM_QWEN3", "1") == "1"@@             config=model_config,             disable_deep_gemm=disable_deep_gemm,         )

Add import once at the top of the file:

importos

If you prefer, we can instead plumb adisable_deep_gemm flag throughModelConfig and only disable when FP8 is active andsm == 100.


251-261:Mirror the configurability for MLP.

Keep Attention/MLP behavior consistent by reusing the same toggle when constructingGatedMLP.

-        # Qwen3 has accuracy issues with deep_gemm (see: https://nvbugspro.nvidia.com/bug/5461712)-        disable_deep_gemm = True+        # Qwen3 has accuracy issues with deep_gemm+        disable_deep_gemm = os.getenv("TLLM_DISABLE_DEEP_GEMM_QWEN3", "1") == "1"@@             config=model_config,             disable_deep_gemm=disable_deep_gemm,         )
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

856-881:Balance NVTX ranges: add range_pop for the pushes in Stage 2.

Both branches push an NVTX range but never pop, which skews traces.

                 if previous_batch is not None:                     sample_state = previous_batch.sample_state                     if not self.dist.is_last_pp_rank:                         torch.cuda.nvtx.range_push(                             "_handle_new_tokens_inter_pp")@@                     else:                         torch.cuda.nvtx.range_push("_handle_new_tokens_last_pp")                         sample_state.sampler_event.synchronize()                      # Send tokens to next pp rank (w.r.t model forward direction)                     # Second last rank does not need to since last rank has original decoded tokens                     if not self.dist.is_second_last_pp_rank:                         self.wait_on_pp_send_handles(prev_microbatch_id)                         self.send_handles[                             prev_microbatch_id] = self.dist.isend_object(                                 sample_state.host,                                 dest=self.dist.next_pp_rank,                                 tag=prev_microbatch_id)+                    torch.cuda.nvtx.range_pop()
tensorrt_llm/_torch/modules/linear.py (2)

616-631:Guard looks correct; add a quick SM cache to avoid repeated device queries.

Minor: cachesm = get_sm_version() once per call (or at init) to avoid duplicate CUDA property queries in hot path.

-        if get_sm_version() == 100 and not module.disable_deep_gemm:+        sm = get_sm_version()+        if sm == 100 and not module.disable_deep_gemm:

1598-1599:Public API grew: documentdisable_deep_gemm.

Please add a short Google-style docstring entry for this constructor param and the new attribute to align with repo guidelines.

Also applies to: 1616-1617

tensorrt_llm/_torch/attention_backend/trtllm.py (1)

603-606:OK to delegate to buffer-aware post-init; consider type annotation.

_post_init_with_buffers(self, buffers) is fine. Consider annotatingbuffers: Optional[dict[str, list[torch.Tensor]]] for clarity.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)

1568-1572:Consolidate SM 120/121 skip and tighten condition.

Use membership test and include actual SM in the message for quicker triage.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120-                                        or get_sm_version() == 121):-            pytest.skip(-                "MOE TRTLLM backend does not support SM version 120 or 121")+        if moe_backend == "TRTLLM":+            sm = get_sm_version()+            if sm in (120, 121):+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")

1621-1624:Apply same SM check refactor here.

Mirror the consolidated pattern to avoid duplication and repeated device queries.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120-                                        or get_sm_version() == 121):-            pytest.skip(-                "MOE TRTLLM backend does not support SM version 120 or 121")+        if moe_backend == "TRTLLM":+            sm = get_sm_version()+            if sm in (120, 121):+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")

1895-1899:Same: compress SM 120/121 guard.

Keep the message informative with the detected SM.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120-                                        or get_sm_version() == 121):-            pytest.skip(-                "MOE TRTLLM backend does not support SM version 120 or 121")+        if moe_backend == "TRTLLM":+            sm = get_sm_version()+            if sm in (120, 121):+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")

2523-2527:Same: unify the TRTLLM MOE skip.

Minor style and efficiency improvement.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120-                                        or get_sm_version() == 121):-            pytest.skip(-                "MOE TRTLLM backend does not support SM version 120 or 121")+        if moe_backend == "TRTLLM":+            sm = get_sm_version()+            if sm in (120, 121):+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")

2719-2723:Same: compress condition, show SM.

Align with other sites.

-        if moe_backend == "TRTLLM" and (get_sm_version() == 120-                                        or get_sm_version() == 121):-            pytest.skip(-                "MOE TRTLLM backend does not support SM version 120 or 121")+        if moe_backend == "TRTLLM":+            sm = get_sm_version()+            if sm in (120, 121):+                pytest.skip(f"MOE TRTLLM backend does not support SM version {sm}")
tensorrt_llm/_torch/modules/gated_mlp.py (3)

32-34:New flags accepted; document behavior and defaults.

Expose a brief docstring (init param section) foruse_cute_dsl_blockscaling_mm anddisable_deep_gemm to make their interaction clear.

Apply:

 class GatedMLP(nn.Module):@@-    def __init__(self,+    def __init__(self,                  *,                  hidden_size: int,@@-                 use_cute_dsl_blockscaling_mm: bool = False,-                 disable_deep_gemm: bool = False):+                 use_cute_dsl_blockscaling_mm: bool = False,+                 disable_deep_gemm: bool = False):+        """+        Args:+            use_cute_dsl_blockscaling_mm: Enable CUTE DSL block-scaling matmul path.+            disable_deep_gemm: Force-disable deep-GEMM (e.g., on SM100) for this MLP.+        """

108-118:Add a brief docstring for_apply_activation (newhas_lora).

Clarify that FP8 is disabled when LoRA is active due to grouped_gemm limitation.

-    def _apply_activation(self, x, *, has_lora: bool = False):+    def _apply_activation(self, x, *, has_lora: bool = False):+        """Apply activation.+        Args:+            has_lora: If True, route to non-FP8 path until LoRA grouped_gemm supports FP8.+        """

146-151:Explicitness nit: passhas_lora=False for readability.

No behavior change; improves clarity when scanning callsites.

-        h2 = self._apply_activation(h1)+        h2 = self._apply_activation(h1, has_lora=False)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (5)

67-73:Generalize termination to detected remote size.

Avoid hard-codeddest=0/1; iterate over remote group.

 def mpi_send_termination_request(intercomm):     if intercomm is not None:-        # Send termination requests-        intercomm.send(None, dest=0, tag=MPI_REQUEST)-        intercomm.send(None, dest=1, tag=MPI_REQUEST)-        print("Sent termination requests to the workers.")+        # Send termination requests to all remote ranks+        try:+            for dest in range(intercomm.Get_remote_size()):+                intercomm.send(None, dest=dest, tag=MPI_REQUEST)+            print("Sent termination requests to the workers.", flush=True)+        except Exception as e:+            print(f"Termination request send error: {e}", flush=True)

87-96:Fix error messages: it's Lookup/Connect, not Publish.

Message text is misleading when Connect fails.

-    except MPI.Exception as e:-        print(f"Error publishing port name: {e}")-        raise e-    except Exception as e:-        print(f"Unexpected error publishing port name: {e}")-        raise e+    except MPI.Exception as e:+        print(f"Error looking up or connecting to port name: {e}", flush=True)+        raise+    except Exception as e:+        print(f"Unexpected error looking up or connecting to port name: {e}", flush=True)+        raise

381-384:Preserve exception details in MPI except block.

-        except MPI.Exception as e:-            print(f"MPI Error")-            raise e+        except MPI.Exception as e:+            print(f"MPI Error: {e}", flush=True)+            raise

485-488:Preserve exception details in MPI except block.

-        except MPI.Exception as e:-            print(f"MPI Error")-            raise e+        except MPI.Exception as e:+            print(f"MPI Error: {e}", flush=True)+            raise

25-36:Optional: unique service name per test to avoid cross-test collisions.

Derive a unique service name (e.g., using PID/UUID) instead of hard-coding'my_port'.

Example:

-MPI.Publish_name('my_port', port_name)+service_name = f"trtllm_port_{os.getpid()}_{int(asyncio.get_running_loop().time()*1e9)}"+MPI.Publish_name(service_name, port_name)

Propagateservice_name to lookup sites.

@dominicshanshan
Copy link
CollaboratorAuthor

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17639 [ run ] triggered by Bot

NVIDIA#7159)Signed-off-by: Hui Gao <huig@nvidia.com>Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: qqiao <qqiao@nvidia.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…aph_config.max_batch_size is not provided in trtllm-bench (NVIDIA#7031)Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…te (NVIDIA#7225)* Why?Some models (e.g. anything produced by Mistral) can have both shardedsafetensors and a consolidated safetensor in the same checkpointdirectory. In such cases, prefetching both to memory is a waste of time,and memory.* What?This commit skips over consolidated safetensors when they are not theonly safetensor file present in the checkpoint directory.Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…VIDIA#7265)Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…DIA#7267)Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…er (NVIDIA#7245)Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
@dominicshanshan
Copy link
CollaboratorAuthor

/bot run --skip-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17966 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17966 [ run ] completed with stateSUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13466(Partly Tested) completed with status: 'SUCCESS'

venkywonkaand others added6 commitsSeptember 7, 2025 22:16
… for nemotron-nas (NVIDIA#7231)Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…VIDIA#7271)Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…a_params_from_requests (NVIDIA#7203)Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…uracy issues (NVIDIA#7170)Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
@dominicshanshan
Copy link
CollaboratorAuthor

/bot skip --comment "Drop erroneous change"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17981 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17981 [ skip ] completed with stateSUCCESS
Skipping testing for commit74fc47c

@nv-guomingznv-guomingz merged commitc9dca69 intoNVIDIA:mainSep 8, 2025
5 checks passed
jmydurant added a commit to jmydurant/TensorRT-LLM that referenced this pull requestSep 9, 2025
Signed-off-by: Nave Assaf <nassaf@nvidia.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>Signed-off-by: qqiao <qqiao@nvidia.com>Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>Signed-off-by: Bo Deng <deemod@nvidia.com>Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>Signed-off-by: Pamela <179191831+pamelap-nvidia@users.noreply.github.com>Signed-off-by: Hui Gao <huig@nvidia.com>Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>Signed-off-by: Michal Guzek <mguzek@nvidia.com>Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>Co-authored-by: Nave Assaf <55059536+Naveassaf@users.noreply.github.com>Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>Co-authored-by: Emma Qiao <qqiao@nvidia.com>Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Co-authored-by: Bo Deng <deemod@nvidia.com>Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>Co-authored-by: yifeizhang-c <219273404+yifeizhang-c@users.noreply.github.com>Co-authored-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>Co-authored-by: chenfeiz0326 <chenfeiz@nvidia.com>Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>Co-authored-by: HuiGao-NV <huig@nvidia.com>Co-authored-by: milesial <milesial@users.noreply.github.com>Co-authored-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com>Co-authored-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>Co-authored-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>Co-authored-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull requestSep 20, 2025
Signed-off-by: Nave Assaf <nassaf@nvidia.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>Signed-off-by: qqiao <qqiao@nvidia.com>Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>Signed-off-by: Bo Deng <deemod@nvidia.com>Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>Signed-off-by: Pamela <179191831+pamelap-nvidia@users.noreply.github.com>Signed-off-by: Hui Gao <huig@nvidia.com>Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>Signed-off-by: Michal Guzek <mguzek@nvidia.com>Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>Co-authored-by: Nave Assaf <55059536+Naveassaf@users.noreply.github.com>Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>Co-authored-by: Emma Qiao <qqiao@nvidia.com>Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Co-authored-by: Bo Deng <deemod@nvidia.com>Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>Co-authored-by: yifeizhang-c <219273404+yifeizhang-c@users.noreply.github.com>Co-authored-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>Co-authored-by: chenfeiz0326 <chenfeiz@nvidia.com>Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>Co-authored-by: HuiGao-NV <huig@nvidia.com>Co-authored-by: milesial <milesial@users.noreply.github.com>Co-authored-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com>Co-authored-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>Co-authored-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>Co-authored-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@coderabbitaicoderabbitai[bot]coderabbitai[bot] left review comments

@tongyuantongyutongyuantongyutongyuantongyu approved these changes

@byshiuebyshiuebyshiue approved these changes

@SuperjomnSuperjomnSuperjomn approved these changes

@chzblychchzblychchzblych approved these changes

@QiJuneQiJuneQiJune approved these changes

@hyuknhyuknhyukn approved these changes

@Wanli-JiangWanli-JiangWanli-Jiang approved these changes

@pcastonguaypcastonguaypcastonguay approved these changes

@FrankD412FrankD412FrankD412 approved these changes

@nv-guomingznv-guomingznv-guomingz approved these changes

@yuxianqyuxianqyuxianq approved these changes

@joyang-nvjoyang-nvjoyang-nv approved these changes

@chuangz0chuangz0Awaiting requested review from chuangz0chuangz0 is a code owner automatically assigned from NVIDIA/trt-llm-disagg-devs

@syuonisyuoniAwaiting requested review from syuonisyuoni is a code owner automatically assigned from NVIDIA/trt-llm-llmapi-devs

@yizhang-nvyizhang-nvAwaiting requested review from yizhang-nvyizhang-nv is a code owner automatically assigned from NVIDIA/trt-llm-torch-graph-compiler

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

39 participants

@dominicshanshan@tensorrt-cicd@Superjomn@FrankD412@byshiue@tongyuantongyu@chzblych@QiJune@hyukn@Wanli-Jiang@pcastonguay@nv-guomingz@yuxianq@joyang-nv@zongfeijing@Naveassaf@yechank-nvidia@brb-nv@Tabrizian@EmmaQiaoCh@bo-nv@liji-nv@yifeizhang-c@amitz-nv@hchings@chenfeiz0326@ChristinaZ@venkywonka@pamelap-nvidia@HuiGao-NV@milesial@Shixiaowei02@moraxu@peaceh-nv@ruodil@Linda-Stadter@jiaganc@2ez4bz@DomBrown

[8]ページ先頭

©2009-2025 Movatter.jp