NVIDIA/TensorRT-LLMPublic

NotificationsYou must be signed in to change notification settings
Fork1.9k
Star12.3k

[Bug]: Managed Weights fail with NVFP4: Strict Type Check & Incorrect Allocation in v1.2.0rc4 #9503

Open

[Bug]: Managed Weights fail with NVFP4: Strict Type Check & Incorrect Allocation in v1.2.0rc4#9503

Labels

Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).bugSomething isn't working

Description

JohnTDI-cpu

opened

on Nov 26, 2025

System Info

When using trtllm-build with --gemm_plugin nvfp4 and managed weights (default behavior or explicit), the runtime fails to load the model due to a strict type mismatch between the Engine expectation (DataType::kFP4) and the Safetensors container (DataType::kINT8 / kUINT8).
Additionally, bypassing the type check reveals that the BufferManager allocates memory based on the element count treating it as 8-bit (1 byte/elem) rather than packed 4-bit (0.5 byte/elem), causing double VRAM usage for weights.
Reproduction Steps
Quantize Qwen 2.5/3 to NVFP4 using modelopt.
Build engine with managed weights:
trtllm-build --checkpoint_dir ... --gemm_plugin nvfp4 --output_dir ...
(Note: Managed weights are generated as .safetensors with I8 or U8 dtype since safetensors lacks FP4 support).
Run trtllm-serve.
Observed Behavior

Type Mismatch Error:
The runtime crashes immediately in tllmRuntime.cpp:
[TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())Weight ... has dtype INT8 but engine expects FP4
Double Allocation (OOM):
If the type assertion is removed/bypassed, the model loads but consumes ~30GB VRAM for a 30B model (expected ~15GB), indicating the allocator is not handling the packed 4-bit stride correctly for managed weights.
Proposed Fix (Analysis)
The issue is located in cpp/tensorrt_llm/runtime/tllmRuntime.cpp, inside setInputTensorsImpl.
The runtime needs logic to handle NVFP4 packed weights specifically:
Accept INT8/UINT8 input tensors if the Engine expects FP4.
Allocate memory using the correct size (packed), not num_elements * 1 byte.
Workaround used:
We successfully ran the model by patching tllmRuntime.cpp to manually use cudaMalloc with size / 2 and wrapping it via ITensor::wrap.
Environment
TRT-LLM Version: v1.2.0rc4 (Docker: release:1.2.0rc4)
GPU: RTX 5090 (Blackwell)
Model: Qwen 3 30B MoE (NVFP4)

Full patch and tutorial available here:
https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix

Who can help?

@Tracin

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in theexamples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce:

Quantize model to NVFP4:

python3 examples/quantization/quantize.py \    --model_dir Qwen/Qwen3-30B-A3B-Instruct \    --qformat nvfp4 \    --dtype bfloat16 \    --output_dir ./Qwen3-30B-NVFP4-Ckpt

Build engine with managed weights:

trtllm-build \    --checkpoint_dir ./Qwen3-30B-NVFP4-Ckpt \    --gemm_plugin nvfp4 \    --max_batch_size 1 \    --max_seq_len 4096 \    --output_dir ./Qwen3-30B-NVFP4-Engine

Run server:

trtllm-serve serve ./Qwen3-30B-NVFP4-Engine \    --tokenizer Qwen/Qwen3-30B-A3B-Instruct

Error:

[TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())
(../tensorrt_llm/runtime/tllmRuntime.cpp:822)

Expected behavior

Runtime should accept INT8/UINT8 packed weights when engine expects FP4 (this is the standard safetensors representation for 4-bit packed data).

BufferManager should allocate 0.5 bytes per FP4 element, not 1 byte.

Model should load with ~15GB VRAM usage for 30B NVFP4 model.

actual behavior

Type mismatch assertion fails immediately - runtime rejects INT8 weights for FP4 engine tensors.
When assertion is bypassed, BufferManager allocates 2x expected memory (30GB instead of 15GB for 30B model).
Results in OOM on 32GB GPU (RTX 5090) where model should fit comfortably.

additional notes

Environment:

TRT-LLM: v1.2.0rc4 (Docker: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4)
GPU: RTX 5090 32GB (Blackwell)
Driver: 580.xx (CUDA 13.0)

Root cause analysis:

Issue is incpp/tensorrt_llm/runtime/tllmRuntime.cpp, functionsetInputTensorsImpl around line 820.

Workaround:

I created a patch using direct cudaMalloc to bypass the faulty allocator.
Achieves 135 tok/s, 24GB VRAM usage.

Full patch:https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix

Before submitting a new issue...

Make sure you already searched for relevant issues, and checked thedocumentation andexamples for answers to frequently asked questions.

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Managed Weights fail with NVFP4: Strict Type Check & Incorrect Allocation in v1.2.0rc4 #9503

Description

System Info

Who can help?

Information

Tasks

Reproduction

Steps to reproduce:

Error:

Expected behavior

actual behavior

additional notes

Environment:

Root cause analysis:

Workaround:

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions