NVIDIA/TensorRT-LLMPublic

NotificationsYou must be signed in to change notification settings
Fork1.9k
Star12.3k

Commit6345074

authored

[None][chore] Weekly mass integration of release/1.1 -- rebase (#9522)

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>Signed-off-by: Mike Iovine <miovine@nvidia.com>Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>Signed-off-by: qgai <qgai@nvidia.com>Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>Signed-off-by: Simeng Liu <simengl@nvidia.com>Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>Signed-off-by: Vincent Zhang <vinczhang@nvidia.com>Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>Signed-off-by: Michal Guzek <mguzek@nvidia.com>Signed-off-by: Michal Guzek <moraxu@users.noreply.github.com>Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>Signed-off-by: leslie-fang25 <leslief@nvidia.com>Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>Co-authored-by: yunruis <205571022+yunruis@users.noreply.github.com>Co-authored-by: sunnyqgg <159101675+sunnyqgg@users.noreply.github.com>Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>Co-authored-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com>Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>Co-authored-by: Vincent Zhang <vcheungyi@163.com>Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>Co-authored-by: Chang Liu <9713593+chang-l@users.noreply.github.com>Co-authored-by: Leslie Fang <leslief@nvidia.com>Co-authored-by: Shunkangz <182541032+Shunkangz@users.noreply.github.com>Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>

1 parentae0124e commit6345074Copy full SHA for 6345074

File tree

37 files changed

+582

-120

lines changed

README.md
cpp
- kernels/fmha_v2
  - setup.py
- tensorrt_llm
  - common
    - opUtils.cpp
  - kernels
    - cutlass_kernels
      - cutlass_heuristic.cpp
    - fmhaDispatcher.cpp
docs/source
- blogs
- features
  - disagg-serving.md
- index.rst
- legacy/reference
  - multimodal-feature-support-matrix.md
- models
  - supported-models.md
- overview.md
- quick-start-guide.md
examples
- llm-api
- models
  - contrib/dit
    - vae_decoder_trt.py
  - core
    - multimodal
      - README.md
    - qwenvl
      - vit_onnx_trt.py
- sample_weight_stripping
  - README.md
tensorrt_llm
- _torch
  - modules/fused_moe
    - fused_moe_wide_ep.py
  - pyexecutor
    - py_executor_creator.py
    - sampler.py
- tools
  - multimodal_builder.py
tests
- integration
  - defs
    - accuracy
      - test_llm_api_pytorch.py
    - test_e2e.py
  - test_lists
- unittest
  - _torch
    - modules
      - test_fused_moe.py
    - sampler
      - test_trtllm_sampler.py
  - llmapi/apps
    - openai_server.py

37 files changed

+582

-120

lines changed

`‎README.md‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -230,7 +230,7 @@ Serverless TensorRT LLM (LLaMA 3 8B) \| Modal Docs [➡️ link](https://modal.co`
`230`	`230`
`231`	`231`	TensorRT LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8,[FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4[AWQ](https://arxiv.org/abs/2306.00978), INT8[SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
`232`	`232`
`233`		-[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT LLM provides a high-level Python[LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA[Dynamo](https://github.com/ai-dynamo/dynamo) and the[Triton Inference Server](https://github.com/triton-inference-server/server).
	`233`	+[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/release/1.1/docs/source/developer-guide/overview.md), TensorRT LLM provides a high-level Python[LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA[Dynamo](https://github.com/ai-dynamo/dynamo) and the[Triton Inference Server](https://github.com/triton-inference-server/server).
`234`	`234`
`235`	`235`	`TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using[native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.`
`236`	`236`

`‎cpp/kernels/fmha_v2/setup.py‎`

Lines changed: 10 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -6398,6 +6398,16 @@ def enumerate_kernels():`
`6398`	`6398`	`andkspec.cross_mha==False`
`6399`	`6399`	`andkspec.flash_attention==True`
`6400`	`6400`	`andkspec.input_layout!=InputLayout.SEPARATE_Q_K_V)`
	`6401`	`+# Gemma3 VL support.`
	`6402`	`+or (kspec.sm==100`
	`6403`	`+andkspec.dtypein ['fp16','bf16','fp16_fp32','e4m3','e4m3_fp32']`
	`6404`	`+andkspec.head_size==72`
	`6405`	`+andkspec.head_size_v==0`
	`6406`	`+andkspec.sage_block_sizesisNone`
	`6407`	`+andkspec.version==2`
	`6408`	`+andkspec.cross_mha==False`
	`6409`	`+andkspec.flash_attention==True`
	`6410`	`+andkspec.input_layout!=InputLayout.SEPARATE_Q_K_V)`
`6401`	`6411`	`# Deepseek MLA (generation 576/512 paged)`
`6402`	`6412`	`or (kspec.smin [90,100,120]`
`6403`	`6413`	`andkspec.dtypein ['bf16','e4m3_fp32']`

`‎cpp/tensorrt_llm/common/opUtils.cpp‎`

Lines changed: 94 additions & 12 deletions

Original file line number	Diff line number	Diff line change
`@@ -179,16 +179,24 @@ class PerCudaCtxPerThreadSingletonCreator`
`179`	`179`	`PerCudaCtxPerThreadSingletonCreator(CreatorFunc creator, DeleterFunc deleter)`
`180`	`180`	`:mCreator{std::move(creator)}`
`181`	`181`	`,mDeleter{std::move(deleter)}`
	`182`	`+ ,mObservers{new std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>()}`
`182`	`183`	`{`
`183`	`184`	`}`
`184`	`185`
	`186`	`+~PerCudaCtxPerThreadSingletonCreator()`
	`187`	`+ {`
	`188`	`+ std::lock_guard<std::mutex> lk{mMutex};`
	`189`	`+deletemObservers;`
	`190`	`+mObservers =nullptr;`
	`191`	`+ }`
	`192`	`+`
`185`	`193`	`std::shared_ptr<T>operator()()`
`186`	`194`	`{`
`187`	`195`	`std::lock_guard<std::mutex> lk{mMutex};`
`188`	`196`	`CUcontext ctx{getCurrentCudaCtx()};`
`189`	`197`	`std::thread::id thread =std::this_thread::get_id();`
`190`	`198`	`autoconst key =std::make_tuple(ctx, thread);`
`191`		`- std::shared_ptr<T> result =mObservers[key].lock();`
	`199`	`+ std::shared_ptr<T> result =(*mObservers)[key].lock();`
`192`	`200`	`if (result ==nullptr)`
`193`	`201`	`{`
`194`	`202`	`TLLM_LOG_TRACE("creating singleton instance for CUDA context %lu and thread %lu", ctx, thread);`
`@@ -202,6 +210,11 @@ class PerCudaCtxPerThreadSingletonCreator`
`202`	`210`	`}`
`203`	`211`	`mDeleter(obj);`
`204`	`212`
	`213`	`+if (mObservers ==nullptr)`
	`214`	`+ {`
	`215`	`+return;`
	`216`	`+ }`
	`217`	`+`
`205`	`218`	`// Clears observer to avoid growth of mObservers, in case users creates/destroys cuda contexts`
`206`	`219`	`// frequently.`
`207`	`220`	`std::shared_ptr<T> observedObjHolder;// Delay destroy to avoid dead lock.`
`@@ -210,17 +223,18 @@ class PerCudaCtxPerThreadSingletonCreator`
`210`	`223`	`// thread just before we lock mMutex. We can't infer that the observer is stale from the fact that`
`211`	`224`	`// obj is destroyed, because shared_ptr ref-count checking and observer removing are not in one`
`212`	`225`	`// atomic operation, and the observer may be changed to observe another instance.`
`213`		`-if (mObservers.find(key) ==mObservers.end())`
	`226`	`+auto it =mObservers->find(key);`
	`227`	`+if (it ==mObservers->end())`
`214`	`228`	`{`
`215`	`229`	`return;`
`216`	`230`	`}`
`217`		`- observedObjHolder =mObservers.at(key).lock();`
	`231`	`+ observedObjHolder =it->second.lock();`
`218`	`232`	`if (observedObjHolder ==nullptr)`
`219`	`233`	`{`
`220`		`-mObservers.erase(key);`
	`234`	`+mObservers->erase(it);`
`221`	`235`	`}`
`222`	`236`	`}};`
`223`		`-mObservers.at(key) = result;`
	`237`	`+(*mObservers)[key] = result;`
`224`	`238`	`}`
`225`	`239`	`else`
`226`	`240`	`{`
`@@ -235,24 +249,78 @@ class PerCudaCtxPerThreadSingletonCreator`
`235`	`249`	`mutable std::mutexmMutex;`
`236`	`250`	`// CUDA resources are per-context and per-thread.`
`237`	`251`	`using CacheKey = std::tuple<CUcontext, std::thread::id>;`
`238`		`- std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>mObservers;`
	`252`	`+ std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>*mObservers;`
	`253`	`+};`
	`254`	`+`
	`255`	`+// Structure to hold memory information`
	`256`	`+structMemoryInfo`
	`257`	`+{`
	`258`	`+size_t free_mb;`
	`259`	`+size_t total_mb;`
	`260`	`+float free_percent;`
`239`	`261`	`};`
`240`	`262`
	`263`	`+// Helper function to get current memory information`
	`264`	`+MemoryInfogetMemoryInfo()`
	`265`	`+{`
	`266`	`+size_t free_mem =0, total_mem =0;`
	`267`	`+TLLM_CUDA_CHECK(cudaMemGetInfo(&free_mem, &total_mem));`
	`268`	`+`
	`269`	`+size_tconst free_mb = free_mem / (1024 *1024);`
	`270`	`+size_tconst total_mb = total_mem / (1024 *1024);`
	`271`	`+floatconst free_percent = (total_mem >0) ? (static_cast<float>(free_mem) / total_mem *100.0f) :0.0f;`
	`272`	`+`
	`273`	`+return {free_mb, total_mb, free_percent};`
	`274`	`+}`
	`275`	`+`
	`276`	`+// Helper function to log current memory usage`
	`277`	`+voidlogMemoryUsage(charconst* operation, CUcontext ctx)`
	`278`	`+{`
	`279`	`+autoconst mem =getMemoryInfo();`
	`280`	`+TLLM_LOG_DEBUG("%s: Context=%p, Free Memory=%zu MB (%.1f%%), Total=%zu MB", operation, ctx, mem.free_mb,`
	`281`	`+ mem.free_percent, mem.total_mb);`
	`282`	`+}`
	`283`	`+`
	`284`	`+// Helper function to throw`
	`285`	`+voidthrowCublasErrorWithMemInfo(charconst* operation, CUcontext ctx, cublasStatus_t status)`
	`286`	`+{`
	`287`	`+autoconst mem =getMemoryInfo();`
	`288`	`+TLLM_THROW(`
	`289`	`+"Failed to create %s."`
	`290`	`+"Status: %d, Context: %p, Free Memory: %zu MB (%.1f%%), Total: %zu MB."`
	`291`	`+"Consider reducing kv_cache_config.free_gpu_memory_fraction.",`
	`292`	`+ operation, status, ctx, mem.free_mb, mem.free_percent, mem.total_mb);`
	`293`	`+}`
	`294`	`+`
`241`	`295`	`}// namespace`
`242`	`296`
`243`	`297`	`std::shared_ptr<cublasHandle_t>getCublasHandle()`
`244`	`298`	`{`
`245`	`299`	`static PerCudaCtxPerThreadSingletonCreator<cublasHandle_t>creator(`
`246`	`300`	`[]() ->auto`
`247`	`301`	`{`
`248`		`-auto handle = std::unique_ptr<cublasHandle_t>(new cublasHandle_t);`
`249`		`-TLLM_CUDA_CHECK(cublasCreate(handle.get()));`
	`302`	`+ CUcontext ctx =getCurrentCudaCtx();`
	`303`	`+logMemoryUsage("Creating cublas handle", ctx);`
	`304`	`+`
	`305`	`+auto handle = std::make_unique<cublasHandle_t>();`
	`306`	`+auto status =cublasCreate(handle.get());`
	`307`	`+`
	`308`	`+if (status != CUBLAS_STATUS_SUCCESS)`
	`309`	`+ {`
	`310`	`+throwCublasErrorWithMemInfo("cublas handle", ctx, status);`
	`311`	`+ }`
	`312`	`+`
`250`	`313`	`return handle;`
`251`	`314`	`},`
`252`	`315`	`[](cublasHandle_t* handle)`
`253`	`316`	`{`
`254`		`-TLLM_CUDA_CHECK(cublasDestroy(*handle));`
	`317`	`+auto status =cublasDestroy(*handle);`
	`318`	`+if (status != CUBLAS_STATUS_SUCCESS)`
	`319`	`+ {`
	`320`	`+TLLM_LOG_WARNING("Failed to destroy cublas handle. Status: %d", status);`
	`321`	`+ }`
`255`	`322`	`delete handle;`
	`323`	`+ handle =nullptr;`
`256`	`324`	`});`
`257`	`325`	`returncreator();`
`258`	`326`	`}`
`@@ -262,14 +330,28 @@ std::shared_ptr<cublasLtHandle_t> getCublasLtHandle()`
`262`	`330`	`static PerCudaCtxPerThreadSingletonCreator<cublasLtHandle_t>creator(`
`263`	`331`	`[]() ->auto`
`264`	`332`	`{`
`265`		`-auto handle = std::unique_ptr<cublasLtHandle_t>(new cublasLtHandle_t);`
`266`		`-TLLM_CUDA_CHECK(cublasLtCreate(handle.get()));`
	`333`	`+ CUcontext ctx =getCurrentCudaCtx();`
	`334`	`+logMemoryUsage("Creating cublasLt handle", ctx);`
	`335`	`+`
	`336`	`+auto handle = std::make_unique<cublasLtHandle_t>();`
	`337`	`+auto status =cublasLtCreate(handle.get());`
	`338`	`+`
	`339`	`+if (status != CUBLAS_STATUS_SUCCESS)`
	`340`	`+ {`
	`341`	`+throwCublasErrorWithMemInfo("cublasLt handle", ctx, status);`
	`342`	`+ }`
	`343`	`+`
`267`	`344`	`return handle;`
`268`	`345`	`},`
`269`	`346`	`[](cublasLtHandle_t* handle)`
`270`	`347`	`{`
`271`		`-TLLM_CUDA_CHECK(cublasLtDestroy(*handle));`
	`348`	`+auto status =cublasLtDestroy(*handle);`
	`349`	`+if (status != CUBLAS_STATUS_SUCCESS)`
	`350`	`+ {`
	`351`	`+TLLM_LOG_WARNING("Failed to destroy cublasLt handle. Status: %d", status);`
	`352`	`+ }`
`272`	`353`	`delete handle;`
	`354`	`+ handle =nullptr;`
`273`	`355`	`});`
`274`	`356`	`returncreator();`
`275`	`357`	`}`

`‎cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp‎`

Lines changed: 3 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -177,13 +177,13 @@ std::vector<CutlassTileConfig> get_candidate_tiles(`
`177`	`177`	`{`
`178`	`178`	`if (sm ==89 \|\| sm >=120)`
`179`	`179`	`{`
`180`		`-return {CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128,`
`181`		`- CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,`
	`180`	`+return {CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,`
`182`	`181`	`CutlassTileConfig::CtaShape64x128x64_WarpShape64x32x64,`
`183`	`182`	`CutlassTileConfig::CtaShape64x64x128_WarpShape32x64x64,`
`184`	`183`	`CutlassTileConfig::CtaShape128x64x64_WarpShape64x32x64,`
`185`	`184`	`CutlassTileConfig::CtaShape128x256x64_WarpShape64x64x64,`
`186`		`- CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64};`
	`185`	`+ CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64,`
	`186`	`+ CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128};`
`187`	`187`	`}`
`188`	`188`	`else`
`189`	`189`	`{`

`‎cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -49,7 +49,7 @@ FmhaDispatcher::FmhaDispatcher(MHARunnerFixedParams fixedParams)`
`49`	`49`	`// TRTLLM-GEN only supports power of 2 head sizes.`
`50`	`50`	`// The exception will fall back to fmha v2.`
`51`	`51`	`// Please update fmha_v2/setup.py if you want to add more supported head sizes.`
`52`		`- ,mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize !=80)`
	`52`	`+ ,mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize !=80 && fixedParams.headSize !=72)`
`53`	`53`	`{`
`54`	`54`	`if (mUseTllmGen)`
`55`	`55`	`{`

`‎docs/source/blogs/H100vsA100.md‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -28,7 +28,7 @@ TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x`
`28`	`28`
`29`	`29`	`<sub>FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1</sub>`
`30`	`30`
`31`		`-The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's[Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html)`
	`31`	`+The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's[Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html)`
`32`	`32`
`33`	`33`	`Stay tuned for a highlight on Llama coming soon!`
`34`	`34`

`‎docs/source/blogs/H200launch.md‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -21,7 +21,7 @@ TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news`
`21`	`21`
`22`	`22`	`<sup>(1) Largest batch supported on given TP configuration by power of 2.</sup> <sup>(2) TP = Tensor Parallelism</sup>`
`23`	`23`
`24`		`-Additional Performance data is available on the[NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in[TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html).`
	`24`	`+Additional Performance data is available on the[NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in[TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html).`
`25`	`25`
`26`	`26`	`###H200 vs H100`
`27`	`27`

`‎docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -124,7 +124,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces`
`124`	`124`
`125`	`125`	`Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.`
`126`	`126`
`127`		`-For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).`
	`127`	`+For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).`
`128`	`128`
`129`	`129`	`### Triton Inference Server`
`130`	`130`

`‎docs/source/features/disagg-serving.md‎`

Lines changed: 1 addition & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -94,7 +94,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces`
`94`	`94`
`95`	`95`	`Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.`
`96`	`96`
`97`		`-For more information on how to use Dynamo with TensorRT-LLM, please refer to[this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).`
	`97`	`+For more information on how to use Dynamo with TensorRT-LLM, please refer to[this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).`
`98`	`98`
`99`	`99`	`###trtllm-serve`
`100`	`100`

`‎docs/source/index.rst‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -77,6 +77,7 @@ Welcome to TensorRT LLM's Documentation!`
`77`	`77`	`features/ray-orchestrator.md`
`78`	`78`	`features/torch_compile_and_piecewise_cuda_graph.md`
`79`	`79`
	`80`	`+`
`80`	`81`	`..toctree::`
`81`	`82`	`:maxdepth:2`
`82`	`83`	`:caption:Developer Guide`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit6345074

File tree

37 files changed

37 files changed

`‎README.md‎`

`‎cpp/kernels/fmha_v2/setup.py‎`

`‎cpp/tensorrt_llm/common/opUtils.cpp‎`

`‎cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp‎`

`‎cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp‎`

`‎docs/source/blogs/H100vsA100.md‎`

`‎docs/source/blogs/H200launch.md‎`

`‎docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md‎`

`‎docs/source/features/disagg-serving.md‎`

`‎docs/source/index.rst‎`

0 commit comments