Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit6345074

Browse files
dominicshanshanyunruissunnyqggbrb-nvSuperjomn
authored
[None][chore] Weekly mass integration of release/1.1 -- rebase (#9522)
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>Signed-off-by: Mike Iovine <miovine@nvidia.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>Signed-off-by: qgai <qgai@nvidia.com>Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>Signed-off-by: Simeng Liu <simengl@nvidia.com>Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>Signed-off-by: Vincent Zhang <vinczhang@nvidia.com>Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>Signed-off-by: Michal Guzek <mguzek@nvidia.com>Signed-off-by: Michal Guzek <moraxu@users.noreply.github.com>Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>Signed-off-by: leslie-fang25 <leslief@nvidia.com>Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>Co-authored-by: yunruis <205571022+yunruis@users.noreply.github.com>Co-authored-by: sunnyqgg <159101675+sunnyqgg@users.noreply.github.com>Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Co-authored-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com>Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>Co-authored-by: Vincent Zhang <vcheungyi@163.com>Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>Co-authored-by: Chang Liu <9713593+chang-l@users.noreply.github.com>Co-authored-by: Leslie Fang <leslief@nvidia.com>Co-authored-by: Shunkangz <182541032+Shunkangz@users.noreply.github.com>Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
1 parentae0124e commit6345074

File tree

37 files changed

+582
-120
lines changed

37 files changed

+582
-120
lines changed

‎README.md‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ Serverless TensorRT LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co
230230

231231
TensorRT LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8,[FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4[AWQ](https://arxiv.org/abs/2306.00978), INT8[SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
232232

233-
[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT LLM provides a high-level Python[LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA[Dynamo](https://github.com/ai-dynamo/dynamo) and the[Triton Inference Server](https://github.com/triton-inference-server/server).
233+
[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/release/1.1/docs/source/developer-guide/overview.md), TensorRT LLM provides a high-level Python[LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA[Dynamo](https://github.com/ai-dynamo/dynamo) and the[Triton Inference Server](https://github.com/triton-inference-server/server).
234234

235235
TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using[native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.
236236

‎cpp/kernels/fmha_v2/setup.py‎

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6398,6 +6398,16 @@ def enumerate_kernels():
63986398
andkspec.cross_mha==False
63996399
andkspec.flash_attention==True
64006400
andkspec.input_layout!=InputLayout.SEPARATE_Q_K_V)
6401+
# Gemma3 VL support.
6402+
or (kspec.sm==100
6403+
andkspec.dtypein ['fp16','bf16','fp16_fp32','e4m3','e4m3_fp32']
6404+
andkspec.head_size==72
6405+
andkspec.head_size_v==0
6406+
andkspec.sage_block_sizesisNone
6407+
andkspec.version==2
6408+
andkspec.cross_mha==False
6409+
andkspec.flash_attention==True
6410+
andkspec.input_layout!=InputLayout.SEPARATE_Q_K_V)
64016411
# Deepseek MLA (generation 576/512 paged)
64026412
or (kspec.smin [90,100,120]
64036413
andkspec.dtypein ['bf16','e4m3_fp32']

‎cpp/tensorrt_llm/common/opUtils.cpp‎

Lines changed: 94 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -179,16 +179,24 @@ class PerCudaCtxPerThreadSingletonCreator
179179
PerCudaCtxPerThreadSingletonCreator(CreatorFunc creator, DeleterFunc deleter)
180180
:mCreator{std::move(creator)}
181181
,mDeleter{std::move(deleter)}
182+
,mObservers{new std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>()}
182183
{
183184
}
184185

186+
~PerCudaCtxPerThreadSingletonCreator()
187+
{
188+
std::lock_guard<std::mutex> lk{mMutex};
189+
deletemObservers;
190+
mObservers =nullptr;
191+
}
192+
185193
std::shared_ptr<T>operator()()
186194
{
187195
std::lock_guard<std::mutex> lk{mMutex};
188196
CUcontext ctx{getCurrentCudaCtx()};
189197
std::thread::id thread =std::this_thread::get_id();
190198
autoconst key =std::make_tuple(ctx, thread);
191-
std::shared_ptr<T> result =mObservers[key].lock();
199+
std::shared_ptr<T> result =(*mObservers)[key].lock();
192200
if (result ==nullptr)
193201
{
194202
TLLM_LOG_TRACE("creating singleton instance for CUDA context %lu and thread %lu", ctx, thread);
@@ -202,6 +210,11 @@ class PerCudaCtxPerThreadSingletonCreator
202210
}
203211
mDeleter(obj);
204212

213+
if (mObservers ==nullptr)
214+
{
215+
return;
216+
}
217+
205218
// Clears observer to avoid growth of mObservers, in case users creates/destroys cuda contexts
206219
// frequently.
207220
std::shared_ptr<T> observedObjHolder;// Delay destroy to avoid dead lock.
@@ -210,17 +223,18 @@ class PerCudaCtxPerThreadSingletonCreator
210223
// thread just before we lock mMutex. We can't infer that the observer is stale from the fact that
211224
// obj is destroyed, because shared_ptr ref-count checking and observer removing are not in one
212225
// atomic operation, and the observer may be changed to observe another instance.
213-
if (mObservers.find(key) ==mObservers.end())
226+
auto it =mObservers->find(key);
227+
if (it ==mObservers->end())
214228
{
215229
return;
216230
}
217-
observedObjHolder =mObservers.at(key).lock();
231+
observedObjHolder =it->second.lock();
218232
if (observedObjHolder ==nullptr)
219233
{
220-
mObservers.erase(key);
234+
mObservers->erase(it);
221235
}
222236
}};
223-
mObservers.at(key) = result;
237+
(*mObservers)[key] = result;
224238
}
225239
else
226240
{
@@ -235,24 +249,78 @@ class PerCudaCtxPerThreadSingletonCreator
235249
mutable std::mutexmMutex;
236250
// CUDA resources are per-context and per-thread.
237251
using CacheKey = std::tuple<CUcontext, std::thread::id>;
238-
std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>mObservers;
252+
std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>*mObservers;
253+
};
254+
255+
// Structure to hold memory information
256+
structMemoryInfo
257+
{
258+
size_t free_mb;
259+
size_t total_mb;
260+
float free_percent;
239261
};
240262

263+
// Helper function to get current memory information
264+
MemoryInfogetMemoryInfo()
265+
{
266+
size_t free_mem =0, total_mem =0;
267+
TLLM_CUDA_CHECK(cudaMemGetInfo(&free_mem, &total_mem));
268+
269+
size_tconst free_mb = free_mem / (1024 *1024);
270+
size_tconst total_mb = total_mem / (1024 *1024);
271+
floatconst free_percent = (total_mem >0) ? (static_cast<float>(free_mem) / total_mem *100.0f) :0.0f;
272+
273+
return {free_mb, total_mb, free_percent};
274+
}
275+
276+
// Helper function to log current memory usage
277+
voidlogMemoryUsage(charconst* operation, CUcontext ctx)
278+
{
279+
autoconst mem =getMemoryInfo();
280+
TLLM_LOG_DEBUG("%s: Context=%p, Free Memory=%zu MB (%.1f%%), Total=%zu MB", operation, ctx, mem.free_mb,
281+
mem.free_percent, mem.total_mb);
282+
}
283+
284+
// Helper function to throw
285+
voidthrowCublasErrorWithMemInfo(charconst* operation, CUcontext ctx, cublasStatus_t status)
286+
{
287+
autoconst mem =getMemoryInfo();
288+
TLLM_THROW(
289+
"Failed to create %s."
290+
"Status: %d, Context: %p, Free Memory: %zu MB (%.1f%%), Total: %zu MB."
291+
"Consider reducing kv_cache_config.free_gpu_memory_fraction.",
292+
operation, status, ctx, mem.free_mb, mem.free_percent, mem.total_mb);
293+
}
294+
241295
}// namespace
242296

243297
std::shared_ptr<cublasHandle_t>getCublasHandle()
244298
{
245299
static PerCudaCtxPerThreadSingletonCreator<cublasHandle_t>creator(
246300
[]() ->auto
247301
{
248-
auto handle = std::unique_ptr<cublasHandle_t>(new cublasHandle_t);
249-
TLLM_CUDA_CHECK(cublasCreate(handle.get()));
302+
CUcontext ctx =getCurrentCudaCtx();
303+
logMemoryUsage("Creating cublas handle", ctx);
304+
305+
auto handle = std::make_unique<cublasHandle_t>();
306+
auto status =cublasCreate(handle.get());
307+
308+
if (status != CUBLAS_STATUS_SUCCESS)
309+
{
310+
throwCublasErrorWithMemInfo("cublas handle", ctx, status);
311+
}
312+
250313
return handle;
251314
},
252315
[](cublasHandle_t* handle)
253316
{
254-
TLLM_CUDA_CHECK(cublasDestroy(*handle));
317+
auto status =cublasDestroy(*handle);
318+
if (status != CUBLAS_STATUS_SUCCESS)
319+
{
320+
TLLM_LOG_WARNING("Failed to destroy cublas handle. Status: %d", status);
321+
}
255322
delete handle;
323+
handle =nullptr;
256324
});
257325
returncreator();
258326
}
@@ -262,14 +330,28 @@ std::shared_ptr<cublasLtHandle_t> getCublasLtHandle()
262330
static PerCudaCtxPerThreadSingletonCreator<cublasLtHandle_t>creator(
263331
[]() ->auto
264332
{
265-
auto handle = std::unique_ptr<cublasLtHandle_t>(new cublasLtHandle_t);
266-
TLLM_CUDA_CHECK(cublasLtCreate(handle.get()));
333+
CUcontext ctx =getCurrentCudaCtx();
334+
logMemoryUsage("Creating cublasLt handle", ctx);
335+
336+
auto handle = std::make_unique<cublasLtHandle_t>();
337+
auto status =cublasLtCreate(handle.get());
338+
339+
if (status != CUBLAS_STATUS_SUCCESS)
340+
{
341+
throwCublasErrorWithMemInfo("cublasLt handle", ctx, status);
342+
}
343+
267344
return handle;
268345
},
269346
[](cublasLtHandle_t* handle)
270347
{
271-
TLLM_CUDA_CHECK(cublasLtDestroy(*handle));
348+
auto status =cublasLtDestroy(*handle);
349+
if (status != CUBLAS_STATUS_SUCCESS)
350+
{
351+
TLLM_LOG_WARNING("Failed to destroy cublasLt handle. Status: %d", status);
352+
}
272353
delete handle;
354+
handle =nullptr;
273355
});
274356
returncreator();
275357
}

‎cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp‎

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -177,13 +177,13 @@ std::vector<CutlassTileConfig> get_candidate_tiles(
177177
{
178178
if (sm ==89 || sm >=120)
179179
{
180-
return {CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128,
181-
CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,
180+
return {CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,
182181
CutlassTileConfig::CtaShape64x128x64_WarpShape64x32x64,
183182
CutlassTileConfig::CtaShape64x64x128_WarpShape32x64x64,
184183
CutlassTileConfig::CtaShape128x64x64_WarpShape64x32x64,
185184
CutlassTileConfig::CtaShape128x256x64_WarpShape64x64x64,
186-
CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64};
185+
CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64,
186+
CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128};
187187
}
188188
else
189189
{

‎cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ FmhaDispatcher::FmhaDispatcher(MHARunnerFixedParams fixedParams)
4949
// TRTLLM-GEN only supports power of 2 head sizes.
5050
// The exception will fall back to fmha v2.
5151
// Please update fmha_v2/setup.py if you want to add more supported head sizes.
52-
,mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize !=80)
52+
,mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize !=80 && fixedParams.headSize !=72)
5353
{
5454
if (mUseTllmGen)
5555
{

‎docs/source/blogs/H100vsA100.md‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x
2828

2929
<sub>FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1</sub>
3030

31-
The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's[Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html)
31+
The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's[Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html)
3232

3333
Stay tuned for a highlight on Llama coming soon!
3434

‎docs/source/blogs/H200launch.md‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news
2121

2222
<sup>*(1) Largest batch supported on given TP configuration by power of 2.*</sup> <sup>*(2) TP = Tensor Parallelism*</sup>
2323

24-
Additional Performance data is available on the[NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in[TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html).
24+
Additional Performance data is available on the[NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in[TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html).
2525

2626
###H200 vs H100
2727

‎docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
124124
125125
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
126126
127-
For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
127+
For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
128128
129129
### Triton Inference Server
130130

‎docs/source/features/disagg-serving.md‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
9494

9595
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
9696

97-
For more information on how to use Dynamo with TensorRT-LLM, please refer to[this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
97+
For more information on how to use Dynamo with TensorRT-LLM, please refer to[this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
9898

9999
###trtllm-serve
100100

‎docs/source/index.rst‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ Welcome to TensorRT LLM's Documentation!
7777
features/ray-orchestrator.md
7878
features/torch_compile_and_piecewise_cuda_graph.md
7979

80+
8081
..toctree::
8182
:maxdepth:2
8283
:caption:Developer Guide

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp