Zars19/TensorRT-LLMPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star2

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

nvidia.github.io/TensorRT-LLM

License

Apache-2.0 license

2 stars 1.9k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github		.github
3rdparty		3rdparty
benchmarks		benchmarks
cpp		cpp
docker		docker
docs		docs
examples		examples
scripts		scripts
tensorrt_llm		tensorrt_llm
tests		tests
windows		windows
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements-dev-windows.txt		requirements-dev-windows.txt
requirements-dev.txt		requirements-dev.txt
requirements-windows.txt		requirements-windows.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Architecture | Results | Examples | Documentation

Latest News

[2024/10/07] 🚀🚀🚀Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries➡️ link

[2024/09/29] 🌟 AI at Meta PyTorch + TensorRT v2.4 🌟 ⚡TensorRT 10.1 ⚡PyTorch 2.4 ⚡CUDA 12.4 ⚡Python 3.12➡️ link
[2024/09/17] ✨ NVIDIA TensorRT-LLM Meetup➡️ link
[2024/09/17] ✨ Accelerating LLM Inference at Databricks with TensorRT-LLM➡️ link
[2024/09/17] ✨ TensorRT-LLM @ Baseten➡️ link
[2024/09/04] 🏎️🏎️🏎️ Best Practices for Tuning TensorRT-LLM for Optimal Serving with BentoML➡️ link
[2024/08/20] 🏎️SDXL with #TensorRT Model Optimizer ⏱️⚡ 🏁 cache diffusion 🏁 quantization aware training 🏁 QLoRA 🏁 #Python 3.12➡️ link
[2024/08/13] 🐍 DIY Code Completion with #Mamba ⚡ #TensorRT #LLM for speed 🤖 NIM for ease ☁️ deploy anywhere➡️ link
[2024/08/06] 🗫 Multilingual Challenge Accepted 🗫🤖 #TensorRT #LLM boosts low-resource languages like Hebrew, Indonesian and Vietnamese ⚡➡️ link
[2024/07/30] Introducing🍊 @SliceXAI ELM Turbo 🤖 train ELM once ⚡ #TensorRT #LLM optimize ☁️ deploy anywhere➡️ link
[2024/07/23] 👀 @AIatMeta Llama 3.1 405B trained on 16K NVIDIA H100s - inference is #TensorRT #LLM optimized ⚡🦙 400 tok/s - per node🦙 37 tok/s - per user🦙 1 node inference➡️ link
[2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference:✅ MultiLingual✅ NIM✅ LoRA tuned adaptors➡️ Tech blog
[2024/07/02] Let the @MistralAI MoE tokens fly 📈 🚀 #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100.➡️ Tech blog

Previous News

[2024/06/24] Enhanced with NVIDIA #TensorRT #LLM, @upstage.ai’s solar-10.7B-instruct is ready to power your developer projects through our API catalog 🏎️. ✨➡️ link
[2024/06/18] CYMI: 🤩 Stable Diffusion 3 dropped last week 🎊 🏎️ Speed up your SD3 with #TensorRT INT8 Quantization➡️ link
[2024/06/18] 🧰Deploying ComfyUI with TensorRT? Here’s your setup guide➡️ link
[2024/06/11] ✨#TensorRT Weight-Stripped Engines ✨Technical Deep Dive for serious coders ✅+99% compression ✅1 set of weights → ** GPUs ✅0 performance loss ✅** models…LLM, CNN, etc.➡️ link
[2024/06/04] ✨ #TensorRT and GeForce #RTX unlock ComfyUI SD superhero powers 🦸⚡ 🎥 Demo:➡️ link📗 DIY notebook:➡️ link
[2024/05/28] ✨#TensorRT weight stripping for ResNet-50 ✨ ✅+99% compression✅1 set of weights → ** GPUs\ ✅0 performance loss ✅** models…LLM, CNN, etc👀 📚 DIY➡️ link
[2024/05/21] ✨@modal_labs has the codes for serverless @AIatMeta Llama 3 on #TensorRT #LLM ✨👀 📚 Marvelous Modal Manual:Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs➡️ link
[2024/05/08] NVIDIA TensorRT Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques ✅quantization ✅sparsity ✅QAT➡️ blog
[2024/05/07] 🦙🦙🦙 24,000 tokens per second 🛫Meta Llama 3 takes off with #TensorRT #LLM 📚➡️ link
[2024/02/06]🚀 Speed up inference with SOTA quantization techniques in TRT-LLM
[2024/01/30] New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
[2023/12/04]Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
[2023/11/27]SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version
[2023/11/13]H200 achieves nearly 12,000 tok/sec on Llama2-13B
[2023/10/22]🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙
[2023/10/19] Getting Started Guide -Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
[2023/10/17]Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows

TensorRT-LLM Overview

TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference.It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4AWQ, INT8SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs

TensorRT-LLM provides a Python API to build LLMs into optimizedTensorRT engines.It contains runtimes in Python (bindings) and C++ to execute those TensorRT engines.It also includes abackend for integration with theNVIDIA Triton Inference Server.Models built with TensorRT-LLM can be executed on a wide range of configurations from a single GPU to multiple nodes with multiple GPUs (usingTensor Parallelism and/orPipeline Parallelism).

TensorRT-LLM comes with several popular models pre-defined. They can easily bemodified and extended to fit custom needs via a PyTorch-like Python API. Refer to theSupport Matrix for a list of supported models.

TensorRT-LLM is built on top of theTensorRT Deep Learning Inference library. It leverages much of TensorRT's deep learning optimizations and adds LLM-specific optimizations on top, as described above. TensorRT is an ahead-of-time compiler; it builds "Engines" which are optimized representations of the compiled model containing the entire execution graph. These engines are optimized for a specific GPU architecture, and can be validated, benchmarked, and serialized for later deployment in a production environment.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community

Model zoo (generated by TRT-LLM rel 0.9 a9356d4b7610330e89c1010f342a9ac644215c52)

About

nvidia.github.io/TensorRT-LLM

Releases

No releases published

Packages

No packages published

Languages

C++99.3%
Python0.6%
Cuda0.1%
CMake0.0%
Shell0.0%
Smarty0.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Latest News

TensorRT-LLM Overview

Getting Started

Community

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

Zars19/TensorRT-LLM

Folders and files

Latest commit

History

Repository files navigation

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Latest News

TensorRT-LLM Overview

Getting Started

Community

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages