- Notifications
You must be signed in to change notification settings - Fork1.9k
[#6507][fix] Fix precision issue due to KV layout mismatch for split/concat kernels#6917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
coderabbitaibot commentedAug 14, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
📝 WalkthroughWalkthroughDefault KV layout for FlashInfer attention changed from "NHD" to "HND". KV cache retrieval now accepts and uses a Changes
Sequence Diagram(s)sequenceDiagram participant Fwd as FlashInferAttention.forward_impl participant Star as StarAttention.forward participant Meta as FlashInferAttentionMetadata participant KV as KVCacheManager Note over Fwd,Star: read metadata.kv_layout and request KV buffers with layout Fwd->>Meta: read kv_layout Fwd->>KV: get_buffers(layer_idx, kv_layout) Star->>Meta: read kv_layout Star->>KV: get_buffers(layer_idx, kv_layout) alt kv_layout == "NHD" KV-->>Fwd: buffers reshaped to [max_pages, kv_factor, tokens_per_block, num_kv_heads, head_dim] KV-->>Star: buffers reshaped to [max_pages, kv_factor, tokens_per_block, num_kv_heads, head_dim] else kv_layout == "HND" KV-->>Fwd: buffers reshaped to [max_pages, kv_factor, num_kv_heads, tokens_per_block, head_dim] KV-->>Star: buffers reshaped to [max_pages, kv_factor, num_kv_heads, tokens_per_block, head_dim] end Fwd->>Fwd: continue attention compute with returned buffers Star->>Star: continue attention compute with returned buffersEstimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for usingCodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
1-3:Add missing 2025 NVIDIA copyright headerPer repo guidelines, Python sources must include the NVIDIA copyright header.
Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. import math import os import weakreftensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1-4:Add missing 2025 NVIDIA copyright headerPer repo guidelines, Python sources must include the NVIDIA copyright header.
Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. import copy import enum import math from abc import ABC, abstractmethod
🧹 Nitpick comments (1)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
59-61:Default kv_layout to HND aligns with split/concat kernels; consider making layout a planning keySetting kv_layout="HND" matches the kernel expectation [max_num_pages, 2, num_kv_heads, page_size, head_dim]. One caveat: kv_layout currently isn’t part of the planning key (PlanParams), but affects wrapper construction. If kv_layout changes over the lifetime of the metadata, cached wrappers could be reused with the wrong layout. Either ensure kv_layout is immutable per metadata instance or include it in the plan key.
If you opt to make the plan cache layout-aware, I can draft a small patch to add kv_layout to PlanParams and its equality key.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/attention_backend/flashinfer.py(2 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class ininit
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/attention_backend/flashinfer.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/attention_backend/flashinfer.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: thorjohnsenPR: NVIDIA/TensorRT-LLM#6910File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0Timestamp: 2025-08-14T21:04:50.208ZLearning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
498-500:KV buffers now requested with kv_layout — LGTMFetching KV cache with kv_layout=metadata.kv_layout is correct and resolves the previous layout mismatch against split/concat kernels.
Note: KVCacheManager.get_buffers is annotated to return Optional[torch.Tensor] but this site assumes a tensor (uses .dtype). Either:
- tighten the return annotation to torch.Tensor (preferred; see suggested change in resource_manager.py), or
- add a guard here:
kv_cache = metadata.kv_cache_manager.get_buffers( self.layer_idx, kv_layout=metadata.kv_layout)+assert kv_cache is not None, "KV buffers are not allocated"
Uh oh!
There was an error while loading.Please reload this page.
brb-nv commentedSep 30, 2025
Requesting@chuangz0'z review as expert on split/concat kernels. |
liji-nv commentedNov 4, 2025
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
ZhangGe6 commentedNov 4, 2025
@yuxianq I am working on polishing this PR and will update it tomorrow. Thanks for your suggestions! |
ZhangGe6 commentedNov 5, 2025
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: thorjohnsenRepo: NVIDIA/TensorRT-LLM PR: 6910File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0Timestamp: 2025-08-14T21:04:50.248ZLearning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsenRepo: NVIDIA/TensorRT-LLM PR: 6910File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0Timestamp: 2025-08-14T21:04:50.248ZLearning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (4)
layer_idx(221-224)layer_idx(221-221)layer_idx(226-229)layer_idx(226-226)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (4)
layer_idx(217-221)layer_idx(217-217)layer_idx(223-226)layer_idx(223-223)
Uh oh!
There was an error while loading.Please reload this page.
yuxianq commentedNov 7, 2025
/bot run --disable-fail-fast |
yuxianq commentedNov 7, 2025
@ZhangGe6 It seems that the last 3 commits have not been signed off, please sign off all commits to pass the DCO check, thanks~ |
tensorrt-cicd commentedNov 7, 2025
PR_Github #23783 [ run ] triggered by Bot. Commit: |
tensorrt-cicd commentedNov 7, 2025
PR_Github #23783 [ run ] completed with state |
…split/concat kernelsThe split/concat kernels expect KV cache in"[max_num_pages, 2, num_kv_heads, page_size, head_dim]" layout. However,the actually used KV cache layout exposed by"KVCacheManager.get_buffers()" is"[max_num_pages, 2, page_size, num_kv_heads, head_dim]".This layout mismatch causes wrong indexing for split/concat kernels,leading to incorrect transferred prefill KV cache. This patch is a quick fixfor flashinfer attn_backend.Signed-off-by: ZhangGe6 <sjtu.zg123@gmail.com>
ZhangGe6 commentedNov 8, 2025
@yuxianq Hi, I trimmed my commits and signed off them, thanks for reminding. In addition, I modified the |
yuxianq commentedNov 10, 2025
/bot run --disable-fail-fast |
tensorrt-cicd commentedNov 10, 2025
PR_Github #23938 [ run ] triggered by Bot. Commit: |
tensorrt-cicd commentedNov 10, 2025
PR_Github #23938 [ run ] completed with state |
Uh oh!
There was an error while loading.Please reload this page.
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
yuxianq commentedNov 11, 2025
@ZhangGe6 I have fixed some CI errors for this PR, please allow me to push to your branch or cherry-pick bugfix commit fromhttps://github.com/yuxianq/TensorRT-LLM/commits/fix_pd_diff_tp by yourself, I will rerun the CI. Thanks~ |
ZhangGe6 commentedNov 11, 2025
@yuxianq OK, you can push to my branch directly. Feel free to remind me if there is something I can/should do. |
yuxianq commentedNov 12, 2025
@ZhangGe6 I get |
ZhangGe6 commentedNov 12, 2025
@yuxianq Hi, I have sent you an invitation to collaborate on my forked TensorRT-LLM repo (via "Setting -> Access -> Collaborator -> Add people"). Please accept it and try again. I'm not yet familiar with GitHub operations. Let me know if I missed anything, or I can cherry-pick your bugfix commit later (maybe tonight). |
yuxianq commentedNov 12, 2025
/bot run --disable-fail-fast |
yuxianq commentedNov 12, 2025
@ZhangGe6 It works. I have started to rerun CI. Thanks~ |
tensorrt-cicd commentedNov 12, 2025
PR_Github #24236 [ run ] triggered by Bot. Commit: |
tensorrt-cicd commentedNov 12, 2025
PR_Github #24236 [ run ] completed with state |
yuxianq commentedNov 12, 2025
/bot run --disable-fail-fast |
tensorrt-cicd commentedNov 12, 2025
PR_Github #24301 [ run ] triggered by Bot. Commit: |
tensorrt-cicd commentedNov 12, 2025
PR_Github #24301 [ run ] completed with state |
yuxianq commentedNov 13, 2025
/bot run --disable-fail-fast |
tensorrt-cicd commentedNov 13, 2025
PR_Github #24354 [ run ] triggered by Bot. Commit: |
tensorrt-cicd commentedNov 13, 2025
PR_Github #24354 [ run ] completed with state |
49df731 intoNVIDIA:mainUh oh!
There was an error while loading.Please reload this page.
…split/concat kernels (NVIDIA#6917)Signed-off-by: ZhangGe6 <sjtu.zg123@gmail.com>Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Uh oh!
There was an error while loading.Please reload this page.
The split/concat kernels expect KV cache in
[max_num_pages, 2, num_kv_heads, page_size, head_dim]layout. However, the actually used KV cache layout exposed byKVCacheManager.get_buffers()is[max_num_pages, 2, page_size, num_kv_heads, head_dim]. This layout mismatch causes wrong indexing for split/concat kernels, leading to incorrect transferred prefill KV cache. This patch is a quick fix for flashinfer attn_backend.Summary by CodeRabbit
New Features
Refactor
Chores
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: DoesNOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: DoesNOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: DoesNOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: DoesNOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: DoesNOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: DoesNOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) :Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: DoesNOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.