Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[Bug]: Managed Weights fail with NVFP4: Strict Type Check & Incorrect Allocation in v1.2.0rc4 #9503

Open
Labels
Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).bugSomething isn't working
@JohnTDI-cpu

Description

@JohnTDI-cpu

System Info

When using trtllm-build with --gemm_plugin nvfp4 and managed weights (default behavior or explicit), the runtime fails to load the model due to a strict type mismatch between the Engine expectation (DataType::kFP4) and the Safetensors container (DataType::kINT8 / kUINT8).
Additionally, bypassing the type check reveals that the BufferManager allocates memory based on the element count treating it as 8-bit (1 byte/elem) rather than packed 4-bit (0.5 byte/elem), causing double VRAM usage for weights.
Reproduction Steps
Quantize Qwen 2.5/3 to NVFP4 using modelopt.
Build engine with managed weights:
trtllm-build --checkpoint_dir ... --gemm_plugin nvfp4 --output_dir ...
(Note: Managed weights are generated as .safetensors with I8 or U8 dtype since safetensors lacks FP4 support).
Run trtllm-serve.
Observed Behavior

  1. Type Mismatch Error:
    The runtime crashes immediately in tllmRuntime.cpp:
    [TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())Weight ... has dtype INT8 but engine expects FP4
  2. Double Allocation (OOM):
    If the type assertion is removed/bypassed, the model loads but consumes ~30GB VRAM for a 30B model (expected ~15GB), indicating the allocator is not handling the packed 4-bit stride correctly for managed weights.
    Proposed Fix (Analysis)
    The issue is located in cpp/tensorrt_llm/runtime/tllmRuntime.cpp, inside setInputTensorsImpl.
    The runtime needs logic to handle NVFP4 packed weights specifically:
    Accept INT8/UINT8 input tensors if the Engine expects FP4.
    Allocate memory using the correct size (packed), not num_elements * 1 byte.
    Workaround used:
    We successfully ran the model by patching tllmRuntime.cpp to manually use cudaMalloc with size / 2 and wrapping it via ITensor::wrap.
    Environment
    TRT-LLM Version: v1.2.0rc4 (Docker: release:1.2.0rc4)
    GPU: RTX 5090 (Blackwell)
    Model: Qwen 3 30B MoE (NVFP4)

Full patch and tutorial available here:
https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in theexamples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce:

  1. Quantize model to NVFP4:
python3 examples/quantization/quantize.py \    --model_dir Qwen/Qwen3-30B-A3B-Instruct \    --qformat nvfp4 \    --dtype bfloat16 \    --output_dir ./Qwen3-30B-NVFP4-Ckpt
  1. Build engine with managed weights:
trtllm-build \    --checkpoint_dir ./Qwen3-30B-NVFP4-Ckpt \    --gemm_plugin nvfp4 \    --max_batch_size 1 \    --max_seq_len 4096 \    --output_dir ./Qwen3-30B-NVFP4-Engine
  1. Run server:
trtllm-serve serve ./Qwen3-30B-NVFP4-Engine \    --tokenizer Qwen/Qwen3-30B-A3B-Instruct

Error:

[TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())
(../tensorrt_llm/runtime/tllmRuntime.cpp:822)

Expected behavior

Runtime should accept INT8/UINT8 packed weights when engine expects FP4 (this is the standard safetensors representation for 4-bit packed data).

BufferManager should allocate 0.5 bytes per FP4 element, not 1 byte.

Model should load with ~15GB VRAM usage for 30B NVFP4 model.

actual behavior

  1. Type mismatch assertion fails immediately - runtime rejects INT8 weights for FP4 engine tensors.

  2. When assertion is bypassed, BufferManager allocates 2x expected memory (30GB instead of 15GB for 30B model).

  3. Results in OOM on 32GB GPU (RTX 5090) where model should fit comfortably.

additional notes

Environment:

  • TRT-LLM: v1.2.0rc4 (Docker: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4)
  • GPU: RTX 5090 32GB (Blackwell)
  • Driver: 580.xx (CUDA 13.0)

Root cause analysis:

Issue is incpp/tensorrt_llm/runtime/tllmRuntime.cpp, functionsetInputTensorsImpl around line 820.

Workaround:

I created a patch using direct cudaMalloc to bypass the faulty allocator.
Achieves 135 tok/s, 24GB VRAM usage.

Full patch:https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked thedocumentation andexamples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp