tensorrt-llm
Here are 30 public repositories matching this topic...
Language:All
Sort:Most stars
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
- Updated
Aug 19, 2025 - Python
A nearly-live implementation of OpenAI's Whisper.
- Updated
Sep 25, 2025 - Python
An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
- Updated
Aug 27, 2024 - Jupyter Notebook
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
- Updated
Aug 2, 2025
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
- Updated
Sep 25, 2025 - Python
OpenAI compatible API for TensorRT LLM triton backend
- Updated
Aug 1, 2024 - Rust
Deep Learning Deployment Framework: Supports tf/torch/trt/trtllm/vllm and other NN frameworks. Support dynamic batching, and streaming modes. It is dual-language compatible with Python and C++, offering scalability, extensibility, and high performance. It helps users quickly deploy models and provide services through HTTP/RPC interfaces.
- Updated
May 8, 2025 - C++
Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.
- Updated
May 14, 2025 - Python
Chat With RTX Python API
- Updated
May 11, 2025 - Python
TensorRT-LLM server with Structured Outputs (JSON) built with Rust
- Updated
Apr 25, 2025 - Rust
A tool for benchmarking LLMs on Modal
- Updated
Aug 29, 2025 - Python
Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.
- Updated
Sep 26, 2024 - C++
Add-in for new Outlook that adds LLM new features (Composition, Summarizing, Q&A). It uses a local LLM via Nvidia TensorRT-LLM
- Updated
Jun 5, 2025 - Python
Getting started with TensorRT-LLM using BLOOM as a case study
- Updated
Mar 7, 2024 - Jupyter Notebook
Whisper in TensorRT-LLM
- Updated
Sep 21, 2023 - C++
LLM tutorial materials include but not limited to NVIDIA NeMo, TensorRT-LLM, Triton Inference Server, and NeMo Guardrails.
- Updated
Jun 26, 2025 - Python
MiniMax-01 is a simple implementation of the MiniMax algorithm, a widely used strategy for decision-making in two-player turn-based games like Tic-Tac-Toe. The algorithm aims to minimize the maximum possible loss for the player, making it a popular choice for developing AI opponents in various game scenarios.
- Updated
Oct 13, 2025
Improve this page
Add a description, image, and links to thetensorrt-llm topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thetensorrt-llm topic, visit your repo's landing page and select "manage topics."