- Notifications
You must be signed in to change notification settings - Fork1.9k
Description
System Info
When using trtllm-build with --gemm_plugin nvfp4 and managed weights (default behavior or explicit), the runtime fails to load the model due to a strict type mismatch between the Engine expectation (DataType::kFP4) and the Safetensors container (DataType::kINT8 / kUINT8).
Additionally, bypassing the type check reveals that the BufferManager allocates memory based on the element count treating it as 8-bit (1 byte/elem) rather than packed 4-bit (0.5 byte/elem), causing double VRAM usage for weights.
Reproduction Steps
Quantize Qwen 2.5/3 to NVFP4 using modelopt.
Build engine with managed weights:
trtllm-build --checkpoint_dir ... --gemm_plugin nvfp4 --output_dir ...
(Note: Managed weights are generated as .safetensors with I8 or U8 dtype since safetensors lacks FP4 support).
Run trtllm-serve.
Observed Behavior
- Type Mismatch Error:
The runtime crashes immediately in tllmRuntime.cpp:
[TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())Weight ... has dtype INT8 but engine expects FP4 - Double Allocation (OOM):
If the type assertion is removed/bypassed, the model loads but consumes ~30GB VRAM for a 30B model (expected ~15GB), indicating the allocator is not handling the packed 4-bit stride correctly for managed weights.
Proposed Fix (Analysis)
The issue is located in cpp/tensorrt_llm/runtime/tllmRuntime.cpp, inside setInputTensorsImpl.
The runtime needs logic to handle NVFP4 packed weights specifically:
Accept INT8/UINT8 input tensors if the Engine expects FP4.
Allocate memory using the correct size (packed), not num_elements * 1 byte.
Workaround used:
We successfully ran the model by patching tllmRuntime.cpp to manually use cudaMalloc with size / 2 and wrapping it via ITensor::wrap.
Environment
TRT-LLM Version: v1.2.0rc4 (Docker: release:1.2.0rc4)
GPU: RTX 5090 (Blackwell)
Model: Qwen 3 30B MoE (NVFP4)
Full patch and tutorial available here:
https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to reproduce:
- Quantize model to NVFP4:
python3 examples/quantization/quantize.py \ --model_dir Qwen/Qwen3-30B-A3B-Instruct \ --qformat nvfp4 \ --dtype bfloat16 \ --output_dir ./Qwen3-30B-NVFP4-Ckpt
- Build engine with managed weights:
trtllm-build \ --checkpoint_dir ./Qwen3-30B-NVFP4-Ckpt \ --gemm_plugin nvfp4 \ --max_batch_size 1 \ --max_seq_len 4096 \ --output_dir ./Qwen3-30B-NVFP4-Engine
- Run server:
trtllm-serve serve ./Qwen3-30B-NVFP4-Engine \ --tokenizer Qwen/Qwen3-30B-A3B-Instruct
Error:
[TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())
(../tensorrt_llm/runtime/tllmRuntime.cpp:822)
Expected behavior
Runtime should accept INT8/UINT8 packed weights when engine expects FP4 (this is the standard safetensors representation for 4-bit packed data).
BufferManager should allocate 0.5 bytes per FP4 element, not 1 byte.
Model should load with ~15GB VRAM usage for 30B NVFP4 model.
actual behavior
Type mismatch assertion fails immediately - runtime rejects INT8 weights for FP4 engine tensors.
When assertion is bypassed, BufferManager allocates 2x expected memory (30GB instead of 15GB for 30B model).
Results in OOM on 32GB GPU (RTX 5090) where model should fit comfortably.
additional notes
Environment:
- TRT-LLM: v1.2.0rc4 (Docker: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4)
- GPU: RTX 5090 32GB (Blackwell)
- Driver: 580.xx (CUDA 13.0)
Root cause analysis:
Issue is incpp/tensorrt_llm/runtime/tllmRuntime.cpp, functionsetInputTensorsImpl around line 820.
Workaround:
I created a patch using direct cudaMalloc to bypass the faulty allocator.
Achieves 135 tok/s, 24GB VRAM usage.
Full patch:https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked thedocumentation andexamples for answers to frequently asked questions.