Movatterモバイル変換

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
1-1:Add the NVIDIA Apache‑2.0 header.
Source files must start with the NVIDIA Apache‑2.0 copyright header for 2025.
tensorrt_llm/_torch/models/modeling_llama.py (1)
1-1:Add the NVIDIA Apache‑2.0 header.
This source file should also carry the standard 2025 NVIDIA Apache‑2.0 header.

🧹 Nitpick comments (1)

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (1)
7-26:Gate and bound the integration test for portability and runtime.
Skip when insufficient GPUs and cap generation length to reduce CI time; similarity check should still hold.
+import pytest+import torch@@-def test_llama_3_3():+@pytest.mark.skipif(torch.cuda.device_count() < 4, reason="requires >=4 GPUs")+def test_llama_3_3():@@-    outputs = llm.generate(prompts)+    outputs = llm.generate(prompts, max_new_tokens=16)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between9d719dd and582de7c.

📒 Files selected for processing (4)

tensorrt_llm/_torch/models/modeling_llama.py (2 hunks)
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (2 hunks)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/models/modeling_llama_min_latency.py
tensorrt_llm/_torch/models/modeling_llama.py
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/models/modeling_llama_min_latency.py
tensorrt_llm/_torch/models/modeling_llama.py
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/models/modeling_llama_min_latency.py
tensorrt_llm/_torch/models/modeling_llama.py
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

🧠 Learnings (4)

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104PR: NVIDIA/TensorRT-LLM#7645File: tests/integration/test_lists/qa/llm_function_core.txt:648-648Timestamp: 2025-09-09T09:40:45.658ZLearning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_b200.yml
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_b200.yml
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nvPR: NVIDIA/TensorRT-LLM#7192File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72Timestamp: 2025-08-26T09:49:04.956ZLearning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_b200.yml
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagamPR: NVIDIA/TensorRT-LLM#6487File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12Timestamp: 2025-08-06T13:58:07.506ZLearning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
18-18:LGTM: add Llama‑3.3 multi‑GPU test to B200 list.
Placement and format look correct. Scheduler de‑dupes across wildcarded lists as expected per infra behavior.
tensorrt_llm/_torch/models/modeling_llama.py (2)
561-568:Good defensive downgrade when next layer lacks input_scale.
Guarding input_scale and falling back to RESIDUAL_RMS_NORM prevents attribute errors and avoids invoking quantized fusion without a scale.
771-778:Same here: solid guard and fallback.
Consistent with the earlier block; this avoids QUANT op misuse when the next attention’s qkv_proj lacks input_scale.

tensorrt_llm/_torch/models/modeling_llama_min_latency.pyShow resolvedHide resolved

hyukn reviewed

Sep 15, 2025

tests/unittest/_torch/multi_gpu_modeling/test_llama3.pyShow resolvedHide resolved

tensorrt_llm/_torch/models/modeling_llama.py OutdatedShow resolvedHide resolved

Superjomn force-pushed thefix-tp-pp-error-release branch from582de7c tob3973a0Compare

September 16, 2025 05:25

Copy link

CollaboratorAuthor

Superjomn commentedSep 16, 2025

/bot run

coderabbitaibot reviewed

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/models/modeling_llama.py (1)

561-567:Avoid mutating fusion-op state inside forward; compute a local op instead

Overwritingself.post_feed_forward_fusion_op at runtime makes the downgrade “sticky” for the layer and can silently affect later passes/configs. Use a localfusion_op derived from availability ofinput_scale, and keep the member immutable after init.

Apply this diff:

-                # The next layernorm exists but it could be the last decoder layer.-                # Adjust the scale and fusion pattern.-                if self.next_attn is not None and (-                        self.is_nvfp4 or self.is_fp8_quant) and hasattr(-                            self.next_attn.qkv_proj, 'input_scale'):-                    scale = self.next_attn.qkv_proj.input_scale-                else:-                    self.post_feed_forward_fusion_op = AllReduceFusionOp.RESIDUAL_RMS_NORM-                    scale = None+                # The next layernorm exists but it could be the last decoder layer.+                # Adjust the scale and fusion pattern without mutating layer state.+                fusion_op = self.post_feed_forward_fusion_op+                scale = None+                if self.next_attn is not None and (self.is_nvfp4 or self.is_fp8_quant):+                    next_proj = getattr(self.next_attn, "qkv_proj", None)+                    scale = getattr(next_proj, "input_scale", None)+                    if scale is None:+                        fusion_op = AllReduceFusionOp.RESIDUAL_RMS_NORM@@-                    allreduce_output = self.all_reduce(+                    allreduce_output = self.all_reduce(                         hidden_states,                         all_reduce_params=AllReduceParams(-                            fusion_op=self.post_feed_forward_fusion_op,+                            fusion_op=fusion_op,                             residual=residual,                             norm_weight=self.next_layer_layernorm.weight,                             scale=scale,                             eps=self.next_layer_layernorm.variance_epsilon,                         ))

Also considers cases whereqkv_proj itself may be absent in dummy next-attn modules.

Also applies to: 588-595

♻️ Duplicate comments (1)

tensorrt_llm/_torch/models/modeling_llama.py (1)

771-777:Same state-mutation issue on the Llama (non‑4) path

Replicate the fix here to keeppost_mlp_fusion_op immutable during forward.

Apply this diff:

-                # The next layernorm exists but it could be the last decoder layer.-                # Adjust the scale and fusion pattern.-                if self.next_attn is not None and (-                        self.is_nvfp4 or self.is_fp8_quant) and hasattr(-                            self.next_attn.qkv_proj, 'input_scale'):-                    scale = self.next_attn.qkv_proj.input_scale-                else:-                    self.post_mlp_fusion_op = AllReduceFusionOp.RESIDUAL_RMS_NORM-                    scale = None+                # The next layernorm exists but it could be the last decoder layer.+                # Adjust the scale and fusion pattern without mutating layer state.+                fusion_op = self.post_mlp_fusion_op+                scale = None+                if self.next_attn is not None and (self.is_nvfp4 or self.is_fp8_quant):+                    next_proj = getattr(self.next_attn, "qkv_proj", None)+                    scale = getattr(next_proj, "input_scale", None)+                    if scale is None:+                        fusion_op = AllReduceFusionOp.RESIDUAL_RMS_NORM@@-                all_reduce_output = self.all_reduce(+                all_reduce_output = self.all_reduce(                     hidden_states,                     all_reduce_params=AllReduceParams(-                        fusion_op=self.post_mlp_fusion_op,+                        fusion_op=fusion_op,                         residual=residual,                         norm_weight=self.next_layer_layernorm.weight,                         scale=scale,                         eps=self.next_layer_layernorm.variance_epsilon,                     ))

Also applies to: 779-787

🧹 Nitpick comments (1)

tensorrt_llm/_torch/models/modeling_llama.py (1)
1-1:License header missing (pre‑existing)
Per repo guidelines, prepend the NVIDIA Apache‑2.0 header with the current year. Not blocking this PR; consider a follow‑up sweep.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between582de7c andb3973a0.

📒 Files selected for processing (4)

tensorrt_llm/_torch/models/modeling_llama.py (2 hunks)
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (2 hunks)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tensorrt_llm/_torch/models/modeling_llama_min_latency.py
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/models/modeling_llama.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/models/modeling_llama.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/models/modeling_llama.py

🧠 Learnings (3)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_b200.yml

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104PR: NVIDIA/TensorRT-LLM#7645File: tests/integration/test_lists/qa/llm_function_core.txt:648-648Timestamp: 2025-09-09T09:40:45.658ZLearning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_b200.yml

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nvPR: NVIDIA/TensorRT-LLM#7192File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72Timestamp: 2025-08-26T09:49:04.956ZLearning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_b200.yml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
18-18:LGTM: adds focused Llama‑3.3 multi‑GPU unittest to the B200 pool
Placement and node-id format are consistent with existing entries; should run on 4xGPU TP2+PP2 as intended. Please confirm the model asset path is available on B200 CI workers to avoid skips.

hyukn approved these changes

Copy link

Collaborator

hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM.

litaotju approved these changes

Superjomn force-pushed thefix-tp-pp-error-release branch fromb3973a0 to2a48811Compare

September 16, 2025 11:53

Copy link

CollaboratorAuthor

Superjomn commentedSep 16, 2025

/bot run

Superjomnenabled auto-merge (squash)

September 16, 2025 11:54

Superjomn added the Release BlockerPRs that blocking the final release build or branching out the release branch label

Copy link

Collaborator

tensorrt-cicd commentedSep 16, 2025

PR_Github #18791 [ run ] triggered by Bot

coderabbitaibot reviewed

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (1)
7-7:On combining with existing LLM API tests
Echoing the earlier thread: consolidating withtest_llm_api_pytorch’s Llama‑3.3 suite is fine to do in a follow‑up, as agreed.
If you want to confirm current coverage overlap quickly:
#!/bin/bash# Find any existing Llama-3.3 tests in PyTorch API suitesrg -nP'Llama[-_ ]?3\.3|Llama3_3' -g'tests/**/test_llm_api_pytorch*.py' -C2

🧹 Nitpick comments (2)

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (2)
15-17:Prompt choice risk: reduce time‑sensitive/non‑deterministic content
“The president of the United States is” can drift. Consider more stable prompts (math/facts) or keep but rely on robust assertions below.
19-26:Determinism and robustness: constrain decoding and assert on signals, not long text
Constrain decoding for reproducibility and speed; prefer short, robust assertions over fuzzy long‑text similarity.
Apply this diff:
-    outputs = llm.generate(prompts)+    outputs = llm.generate(+        prompts,+        max_new_tokens=24,+        temperature=0.0,+        top_k=1,+        random_seed=1234,+    )-    expected_outputs = [-        " a city of romance, art, fashion, and cuisine. Paris, also known as the City of Light, is a must-visit destination for anyone interested in",-        " the head of state and head of government of the United States. The president is also the commander-in-chief of the armed forces. The president is elected by the",-    ]-    for i, output in enumerate(outputs):-        assert similar(output.outputs[0].text, expected_outputs[i])+    for i, res in enumerate(outputs):+        text = res.outputs[0].text.lower()+        if i == 0:+            assert "paris" in text+        else:+            assert "president" in text and "united states" in text
Additionally, if you adopt the above, drop the now-unusedsimilar import.
-from utils.util import similar

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and betweenb3973a0 and2a48811.

📒 Files selected for processing (4)

tensorrt_llm/_torch/models/modeling_llama.py (2 hunks)
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (2 hunks)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/_torch/models/modeling_llama_min_latency.py
tests/integration/test_lists/test-db/l0_dgx_b200.yml
tensorrt_llm/_torch/models/modeling_llama.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

🧠 Learnings (3)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104PR: NVIDIA/TensorRT-LLM#7645File: tests/integration/test_lists/qa/llm_function_core.txt:648-648Timestamp: 2025-09-09T09:40:45.658ZLearning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagamPR: NVIDIA/TensorRT-LLM#6487File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12Timestamp: 2025-08-06T13:58:07.506ZLearning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (1)
1-26:No action required — test is scheduled only on DGX B200
rg search shows the test only in tests/integration/test_lists/test-db/l0_dgx_b200.yml:18.

PR_Github #18896 [ run ] triggered by Bot

Superjomn force-pushed thefix-tp-pp-error-release branch from2a48811 to8ee015aCompare

September 17, 2025 05:28

coderabbitaibot reviewed

Sep 17, 2025

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/models/modeling_llama.py (2)

1-1:Add NVIDIA Apache-2.0 header (2025).

This source file is missing the required license header per guidelines.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.+#+# Licensed under the Apache License, Version 2.0 (the "License");+# you may not use this file except in compliance with the License.+# You may obtain a copy of the License at+#+#     http://www.apache.org/licenses/LICENSE-2.0+#+# Unless required by applicable law or agreed to in writing, software+# distributed under the License is distributed on an "AS IS" BASIS,+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+# See the License for the specific language governing permissions and+# limitations under the License.

561-568:Downgrade fusion op when input_scale is absent to avoid QUANT op with None scale.

Currently, if next_attn.qkv_proj lacks input_scale, scale becomes None while the fusion op may remain QUANT_* — risky/incorrect.

-                if self.next_attn is not None and (self.is_nvfp4-                                                   or self.is_fp8_quant):-                    scale = self.next_attn.qkv_proj.input_scale if hasattr(-                        self.next_attn.qkv_proj, 'input_scale') else None+                if self.next_attn is not None and (self.is_nvfp4+                                                   or self.is_fp8_quant):+                    if hasattr(self.next_attn.qkv_proj, 'input_scale'):+                        scale = self.next_attn.qkv_proj.input_scale+                    else:+                        # No per-layer input scale available for the next QKV (e.g., dummy Linear under TP+PP last PP stage)+                        self.post_feed_forward_fusion_op = AllReduceFusionOp.RESIDUAL_RMS_NORM+                        scale = None                 else:                     self.post_feed_forward_fusion_op = AllReduceFusionOp.RESIDUAL_RMS_NORM                     scale = None

♻️ Duplicate comments (3)

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (3)
1-5:Lazy-load TRT‑LLM; add pytest/torch; avoid heavy top‑level import.
Prevents collection-time failures and aligns with import style guidance. (Previously raised.)
-from utils.llm_data import llm_models_root-from utils.util import similar--from tensorrt_llm import LLM+import pytest+import torch+from utils.llm_data import llm_models_root
7-14:Guard for TRT‑LLM availability, GPUs, and model dir; then import and build.
Skip cleanly if environment cannot satisfy TP=2, PP=2 (needs 4 GPUs). (Previously raised.)
 def test_llama_3_3():     model_dir = llm_models_root(     ) / "llama-3.3-models" / "Llama-3.3-70B-Instruct-FP8"     tp = 2     pp = 2-    llm = LLM(model_dir, tensor_parallel_size=tp, pipeline_parallel_size=pp)+    try:+        import tensorrt_llm as trtllm+    except Exception as e:+        pytest.skip(f"tensorrt-llm unavailable: {e}")+    required = tp * pp+    if not torch.cuda.is_available() or torch.cuda.device_count() < required:+        found = torch.cuda.device_count() if torch.cuda.is_available() else 0+        pytest.skip(f"requires >= {required} GPUs; found {found}")+    if not model_dir.exists():+        pytest.skip(f"model dir not found: {model_dir}")++    llm = trtllm.LLM(+        model_dir,+        tensor_parallel_size=tp,+        pipeline_parallel_size=pp,+    )
1-26:Move TRT-LLM import into the test and add GPU/model checks.
Top-levelfrom tensorrt_llm import LLM found at tests/unittest/_torch/multi_gpu_modeling/test_llama3.py:4; no torch.cuda/pytest skip or model_dir.exists() check. Change required:
Move the tensorrt_llm import into test_llama_3_3() (lazy import) so collection doesn't trigger heavy runtime.
At the start of the test, perform runtime guards and skip when unmet (e.g., compute required_gpus = tp * pp; if not torch.cuda.is_available() or torch.cuda.device_count() < required_gpus: pytest.skip(...); if not model_dir.exists(): pytest.skip(...)).

🧹 Nitpick comments (3)

tensorrt_llm/_torch/models/modeling_llama.py (1)

561-565:Add brief inline context for hasattr guard.
Note that next_attn.qkv_proj can be a dummy Linear (skip_forward) on last PP rank so FP8QDQ monkey-patch doesn’t attach input_scale.

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (2)

21-26:Deflake assertions: avoid content that shifts over time; assert non‑empty outputs.

Use stable checks to reduce CI flakes in release pipeline.

-    expected_outputs = [-        " a city of romance, art, fashion, and cuisine. Paris, also known as the City of Light, is a must-visit destination for anyone interested in",-        " the head of state and head of government of the United States. The president is also the commander-in-chief of the armed forces. The president is elected by the",-    ]-    for i, output in enumerate(outputs):-        assert similar(output.outputs[0].text, expected_outputs[i])+    assert len(outputs) == len(prompts)+    for out in outputs:+        text = out.outputs[0].text+        assert isinstance(text, str) and len(text) > 0

7-26:Optional: keep a minimal, deterministic check.

If you prefer a content check, change the second prompt to a stable fact (e.g., math) and assert a short prefix.

-    prompts = [-        "The capital of France is",-        "The president of the United States is",-    ]+    prompts = [+        "The capital of France is",+        "The square root of 144 is",+    ]@@-    outputs = llm.generate(prompts)+    outputs = llm.generate(prompts)@@-    expected_outputs = [-        " a city of romance, art, fashion, and cuisine. Paris, also known as the City of Light, is a must-visit destination for anyone interested in",-        " the head of state and head of government of the United States. The president is also the commander-in-chief of the armed forces. The president is elected by the",-    ]-    for i, output in enumerate(outputs):-        assert similar(output.outputs[0].text, expected_outputs[i])+    assert outputs[0].outputs[0].text.lstrip().startswith("Paris")+    assert outputs[1].outputs[0].text.strip().startswith("12")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between2a48811 and8ee015a.

📒 Files selected for processing (4)

tensorrt_llm/_torch/models/modeling_llama.py (2 hunks)
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (2 hunks)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tensorrt_llm/_torch/models/modeling_llama_min_latency.py
tests/integration/test_lists/test-db/l0_dgx_b200.yml

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/models/modeling_llama.py
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/models/modeling_llama.py
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/models/modeling_llama.py
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

🧠 Learnings (6)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104PR: NVIDIA/TensorRT-LLM#7645File: tests/integration/test_lists/qa/llm_function_core.txt:648-648Timestamp: 2025-09-09T09:40:45.658ZLearning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagamPR: NVIDIA/TensorRT-LLM#6487File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12Timestamp: 2025-08-06T13:58:07.506ZLearning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-29T14:07:45.863Z

Learnt from: EmmaQiaoChPR: NVIDIA/TensorRT-LLM#7370File: tests/unittest/trt/model_api/test_model_quantization.py:24-27Timestamp: 2025-08-29T14:07:45.863ZLearning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nvPR: NVIDIA/TensorRT-LLM#7192File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72Timestamp: 2025-08-26T09:49:04.956ZLearning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartierPR: NVIDIA/TensorRT-LLM#6763File: tests/integration/defs/triton_server/conftest.py:16-22Timestamp: 2025-08-11T20:09:24.389ZLearning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

tensorrt_llm/_torch/models/modeling_llama.pyShow resolvedHide resolved

Copy link

CollaboratorAuthor

Superjomn commentedSep 17, 2025

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-2"

Copy link

CollaboratorAuthor

Superjomn commentedSep 17, 2025

/bot run

Copy link

Collaborator

tensorrt-cicd commentedSep 17, 2025

PR_Github #18942 [ run ] triggered by Bot

Superjomn force-pushed thefix-tp-pp-error-release branch from8ee015a to63e9c6bCompare

September 17, 2025 08:31

Copy link

CollaboratorAuthor

Superjomn commentedSep 17, 2025

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-2,GB200-4_GPUs-PyTorch-1"

Copy link

Collaborator

tensorrt-cicd commentedSep 17, 2025

PR_Github #18948 [ run ] triggered by Bot

Copy link

Collaborator

tensorrt-cicd commentedSep 17, 2025

PR_Github #18942 [ run ] completed with stateABORTED
LLM/release-1.0/L0_MergeRequest_PR #409 (Blue Ocean) completed with status: ABORTED

coderabbitaibot reviewed

Sep 17, 2025

Copy link

Contributor

coderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (3)
7-7:Follow‑up: potential consolidation with test_llm_api_pytorch
Per hyukn’s note, consider merging/parametrizing with TestLlama3_3_70BInstruct later. Fine to defer to a subsequent PR.
1-5:Lazy‑load TRT‑LLM; add pytest/torch; avoid top‑level heavy import
Importing LLM at module import time can explode before we can skip; also prefer module‑namespace import. Move TRT‑LLM import inside the test and add pytest/torch imports here.
-from utils.llm_data import llm_models_root-from utils.util import similar--from tensorrt_llm import LLM+import pytest+import torch+from utils.llm_data import llm_models_root+from utils.util import similar
7-13:Guard for TRT‑LLM availability, GPUs, and model dir; use module namespace
Skip cleanly when deps/resources are missing; import tensorrt_llm lazily; and use trtllm.LLM.
 def test_llama_3_3():     model_dir = llm_models_root(     ) / "llama-3.3-models" / "Llama-3.3-70B-Instruct-FP8"     tp = 2     pp = 2-    llm = LLM(model_dir, tensor_parallel_size=tp, pipeline_parallel_size=pp)+    # Lazy import so we can skip instead of erroring during collection+    try:+        import tensorrt_llm as trtllm+    except Exception as e:+        pytest.skip(f"tensorrt-llm unavailable: {e}")++    required = tp * pp+    if not torch.cuda.is_available() or torch.cuda.device_count() < required:+        found = torch.cuda.device_count() if torch.cuda.is_available() else 0+        pytest.skip(f"requires >= {required} GPUs; found {found}")+    if not model_dir.exists():+        pytest.skip(f"model dir not found: {model_dir}")++    llm = trtllm.LLM(model_dir, tensor_parallel_size=tp, pipeline_parallel_size=pp)

🧹 Nitpick comments (4)

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (4)
14-17:Use time‑insensitive prompts to avoid fragile assertions
“The president of the United States is” is time‑variant. Prefer stable facts.
-    prompts = [-        "The capital of France is",-        "The president of the United States is",-    ]+    prompts = [+        "The capital of France is",+        "The capital of Japan is",+    ]
19-19:Make decoding deterministic and cheaper
Set a small max token budget and disable sampling (or temperature=0) to stabilize CI and reduce memory.
-    outputs = llm.generate(prompts)+    # If API differs, use the repo's SamplingParams equivalent with temperature=0 and max_new_tokens≈16+    outputs = llm.generate(prompts, max_new_tokens=16, temperature=0.0)
If generate() doesn’t accept these kwargs, switch to your SamplingParams/GenerationConfig and pass it in. Please confirm the exact parameter names used in this repo.
21-26:Avoid brittle full‑text matches; assert key facts with diagnostics
Long expected blobs are unstable across model/builds. Check for essential substrings and add a length check plus helpful failure messages.
-    expected_outputs = [-        " a city of romance, art, fashion, and cuisine. Paris, also known as the City of Light, is a must-visit destination for anyone interested in",-        " the head of state and head of government of the United States. The president is also the commander-in-chief of the armed forces. The president is elected by the",-    ]-    for i, output in enumerate(outputs):-        assert similar(output.outputs[0].text, expected_outputs[i])+    expected_contains = ["Paris", "Tokyo"]+    assert len(outputs) == len(prompts)+    for i, out in enumerate(outputs):+        text = out.outputs[0].text+        assert expected_contains[i].lower() in text.lower(), (+            f"Output[{i}] missing '{expected_contains[i]}'. Got: {text[:160]!r}"+        )
26-26:Release resources explicitly (optional)
Free the session aggressively to reduce GPU memory pressure on CI runners.
-        assert similar(output.outputs[0].text, expected_outputs[i])+        ...+    # Optional explicit cleanup+    del llm

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between8ee015a and63e9c6b.

📒 Files selected for processing (4)

tensorrt_llm/_torch/models/modeling_llama.py (2 hunks)
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (2 hunks)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tests/integration/test_lists/test-db/l0_dgx_b200.yml
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/models/modeling_llama_min_latency.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

🧠 Learnings (6)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxuPR: NVIDIA/TensorRT-LLM#6303File: tests/integration/test_lists/qa/examples_test_list.txt:494-494Timestamp: 2025-07-28T17:06:08.621ZLearning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104PR: NVIDIA/TensorRT-LLM#7645File: tests/integration/test_lists/qa/llm_function_core.txt:648-648Timestamp: 2025-09-09T09:40:45.658ZLearning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagamPR: NVIDIA/TensorRT-LLM#6487File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12Timestamp: 2025-08-06T13:58:07.506ZLearning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-29T14:07:45.863Z

Learnt from: EmmaQiaoChPR: NVIDIA/TensorRT-LLM#7370File: tests/unittest/trt/model_api/test_model_quantization.py:24-27Timestamp: 2025-08-29T14:07:45.863ZLearning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nvPR: NVIDIA/TensorRT-LLM#7192File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72Timestamp: 2025-08-26T09:49:04.956ZLearning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartierPR: NVIDIA/TensorRT-LLM#6763File: tests/integration/defs/triton_server/conftest.py:16-22Timestamp: 2025-08-11T20:09:24.389ZLearning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tests/unittest/_torch/multi_gpu_modeling/test_llama3.py (1)
7-7:Optional: annotate as multi‑GPU heavy test
Repo search returned no GPU-related pytest markers; add an existing repo marker (e.g., @pytest.mark.requires_4gpus or @pytest.mark.slow) so CI/schedulers route this to multi‑GPU nodes. I can scan the repo and propose the exact decorator.

Copy link

Collaborator

tensorrt-cicd commentedSep 17, 2025

PR_Github #18948 [ run ] completed with stateSUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #410(Partly Tested) completed with status: 'FAILURE'

fix

07ac2c0

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>add testSigned-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>upSigned-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>

Superjomn force-pushed thefix-tp-pp-error-release branch from63e9c6b to07ac2c0Compare

PR_Github #19008 [ run ] completed with stateSUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #414 completed with status: 'SUCCESS'

Superjomn merged commit2f3e3ae intoNVIDIA:release/1.0

Sep 17, 2025

5 checks passed

Superjomn deleted the fix-tp-pp-error-release branch

September 18, 2025 01:45

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request

Sep 23, 2025

[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (NVIDIA#7717)

92855ed

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request

Sep 23, 2025

[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (NVIDIA#7717)

76c36c0

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request

Sep 23, 2025

[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (NVIDIA#7717)

95b37d2

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request

[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (NVIDIA#7717)

d62b38a

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request

[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (NVIDIA#7717)

f9f86f1

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request

[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (NVIDIA#7717)

87575af

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request

[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (NVIDIA#7717)

2318b3d

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request

[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (NVIDIA#7717)

37544bc

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request

[https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case (NVIDIA#7717)

ee4d5ae

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request