Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

License

NotificationsYou must be signed in to change notification settings

Zars19/TensorRT-LLM

 
 

Repository files navigation

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentationpythoncudatrtversionlicense

Architecture   |   Results   |   Examples   |   Documentation


Latest News

  • [2024/10/07] 🚀🚀🚀Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries➡️ link
  • [2024/09/29] 🌟 AI at Meta PyTorch + TensorRT v2.4 🌟 ⚡TensorRT 10.1 ⚡PyTorch 2.4 ⚡CUDA 12.4 ⚡Python 3.12➡️ link

  • [2024/09/17] ✨ NVIDIA TensorRT-LLM Meetup➡️ link

  • [2024/09/17] ✨ Accelerating LLM Inference at Databricks with TensorRT-LLM➡️ link

  • [2024/09/17] ✨ TensorRT-LLM @ Baseten➡️ link

  • [2024/09/04] 🏎️🏎️🏎️ Best Practices for Tuning TensorRT-LLM for Optimal Serving with BentoML➡️ link

  • [2024/08/20] 🏎️SDXL with #TensorRT Model Optimizer ⏱️⚡ 🏁 cache diffusion 🏁 quantization aware training 🏁 QLoRA 🏁 #Python 3.12➡️ link

  • [2024/08/13] 🐍 DIY Code Completion with #Mamba ⚡ #TensorRT #LLM for speed 🤖 NIM for ease ☁️ deploy anywhere➡️ link

  • [2024/08/06] 🗫 Multilingual Challenge Accepted 🗫🤖 #TensorRT #LLM boosts low-resource languages like Hebrew, Indonesian and Vietnamese ⚡➡️ link

  • [2024/07/30] Introducing🍊 @SliceXAI ELM Turbo 🤖 train ELM once ⚡ #TensorRT #LLM optimize ☁️ deploy anywhere➡️ link

  • [2024/07/23] 👀 @AIatMeta Llama 3.1 405B trained on 16K NVIDIA H100s - inference is #TensorRT #LLM optimized ⚡🦙 400 tok/s - per node🦙 37 tok/s - per user🦙 1 node inference➡️ link

  • [2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference:✅ MultiLingual✅ NIM✅ LoRA tuned adaptors➡️ Tech blog

  • [2024/07/02] Let the @MistralAI MoE tokens fly 📈 🚀 #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100.➡️ Tech blog

Previous News

TensorRT-LLM Overview

TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference.It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4AWQ, INT8SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs

TensorRT-LLM provides a Python API to build LLMs into optimizedTensorRT engines.It contains runtimes in Python (bindings) and C++ to execute those TensorRT engines.It also includes abackend for integration with theNVIDIA Triton Inference Server.Models built with TensorRT-LLM can be executed on a wide range of configurations from a single GPU to multiple nodes with multiple GPUs (usingTensor Parallelism and/orPipeline Parallelism).

TensorRT-LLM comes with several popular models pre-defined. They can easily bemodified and extended to fit custom needs via a PyTorch-like Python API. Refer to theSupport Matrix for a list of supported models.

TensorRT-LLM is built on top of theTensorRT Deep Learning Inference library. It leverages much of TensorRT's deep learning optimizations and adds LLM-specific optimizations on top, as described above. TensorRT is an ahead-of-time compiler; it builds "Engines" which are optimized representations of the compiled model containing the entire execution graph. These engines are optimized for a specific GPU architecture, and can be validated, benchmarked, and serialized for later deployment in a production environment.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community

  • Model zoo (generated by TRT-LLM rel 0.9 a9356d4b7610330e89c1010f342a9ac644215c52)

About

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++99.3%
  • Python0.6%
  • Cuda0.1%
  • CMake0.0%
  • Shell0.0%
  • Smarty0.0%

[8]ページ先頭

©2009-2025 Movatter.jp