Supported Models#
The following is a table of supported models for the PyTorch backend:
Architecture | Model | HuggingFace Example |
|---|---|---|
| BERT-based |
|
| Nemotron |
|
| DeepSeek-V3 |
|
| EXAONE 4.0 |
|
| Gemma 3 |
|
| GPT-OSS |
|
| Llama 3.1, Llama 3, Llama 2, LLaMA |
|
| Llama 4 |
|
| Mistral |
|
| Mixtral |
|
| Llama 3.2 |
|
| Nemotron-3, Nemotron-4, Minitron |
|
| NemotronNAS |
|
| Phi-4 |
|
| QwQ, Qwen2 |
|
| Qwen2-based |
|
| Qwen2-based |
|
| Qwen3 |
|
| Qwen3MoE |
|
| Qwen3Next |
|
Model-Feature Support Matrix(Key Models)#
Note: Support for other models may vary. Features marked “N/A” are not applicable to the model architecture.
Model Architecture/Feature | Overlap Scheduler | CUDA Graph | Attention Data Parallelism | Disaggregated Serving | Chunked Prefill | MTP | EAGLE-3(One Model Engine) | EAGLE-3(Two Model Engine) | Torch Sampler | TLLM C++ Sampler | KV Cache Reuse | Sliding Window Attention | Logits Post Processor | Guided Decoding |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yes | Yes | Yes | Yes | Yes[1] | Yes | No | No | Yes | Yes | Yes[2] | N/A | Yes | Yes |
| Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | N/A | Yes | Yes |
| Yes | Yes | No | Untested | Yes | No | No | No | Yes | Yes | No | No | Untested | Untested |
| Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Untested | N/A | Yes | Yes |
| Yes | Yes | Yes | Yes | No | No | Yes | No | Yes | Yes | No | N/A | Yes | Yes |
Multimodal Feature Support Matrix (PyTorch Backend)#
Model Architecture/Feature | Overlap Scheduler | CUDA Graph | Chunked Prefill | Torch Sampler | TLLM C++ Sampler | KV Cache Reuse | Logits Post Processor | EPD Disaggregated Serving | Modality |
|---|---|---|---|---|---|---|---|---|---|
| Yes | Yes | N/A | Yes | Yes | N/A | Yes | No | L + I |
| Yes | Yes | No | Yes | Yes | Yes | Yes | No | L + I |
| Yes | Yes | No | Yes | Yes | No | Yes | No | L + I + V |
| Yes | Yes | No | Yes | Yes | No | Yes | No | L + I |
| Yes | Yes | No | Yes | Yes | No | Yes | No | L + I |
| Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | L + I |
| Yes | Yes | Yes | Yes | Yes | No | Yes | No | L + I + V |
| Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | L + I + A |
| Yes | Yes | No | Yes | Yes | Yes | Yes | No | L + I + V |
| Yes | Yes | No | Yes | Yes | Yes | Yes | No | L + I + V |
Note:
L: Language
I: Image
V: Video
A: Audio
[1]
Chunked Prefill for MLA can only be enabled on SM100.
[2]KV cache reuse for MLA can only be enabled on SM90/SM100 and in BF16/FP8 KV cache dtype.