tensorrt-llm

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

awesome hpc gpu cuda pytorch cublas triton blas llama cutlass cudnn gemm vlm tensorrt ptx tvm mlir llm tensorrt-llm deepseek

UpdatedAug 2, 2025

huggingface /optimum-benchmark

Star316

🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.

benchmark pytorch openvino onnxruntime text-generation-inference neural-compressor tensorrt-llm

UpdatedSep 25, 2025
Python

npuichigo /openai_trtllm

Star215

OpenAI compatible API for TensorRT LLM triton backend

triton-inference-server openai-api llm langchain tensorrt-llm

UpdatedAug 1, 2024
Rust

Deep Learning Deployment Framework: Supports tf/torch/trt/trtllm/vllm and other NN frameworks. Support dynamic batching, and streaming modes. It is dual-language compatible with Python and C++, offering scalability, extensibility, and high performance. It helps users quickly deploy models and provide services through HTTP/RPC interfaces.

tensorflow torch tensorrt serving triton-inference-server dynamic-batching vllm tensorrt-llm

UpdatedMay 8, 2025
C++

NetEase-Media /grps_trtllm

Star155

Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.

openai multi-modal phi function-call qwq ai-agent llm llama-index chatglm internvideo tensorrt-llm qwen2 llama3 minicpm-v internvl qwen2-vl deepseek-r1 janus-pro olmocr qwen3

UpdatedMay 14, 2025
Python

vossr /Chat-With-RTX-python-api

Star64

Chat With RTX Python API

tensorrt llm llm-inference tensorrt-llm mistral-7b llama2-13b chat-with-rtx nvidia-chat-with-rtx

UpdatedMay 11, 2025
Python

guidance-ai /llgtrt

Star59

TensorRT-LLM server with Structured Outputs (JSON) built with Rust

json regex guidance cfg openai-api tensorrt-llm structured-generation

UpdatedApr 25, 2025
Rust

argonne-lcf /LLM-Inference-Bench

Star54

LLM-Inference-Bench

benchmark inference deepspeed llm llamacpp vllm tensorrt-llm

UpdatedJul 18, 2025
Jupyter Notebook

modal-labs /stopwatch

Star43

A tool for benchmarking LLMs on Modal

machine-learning llms vllm tensorrt-llm sglang

UpdatedAug 29, 2025
Python

menloresearch /cortex.tensorrt-llm

Star42

Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.

nvidia jan tensorrt llm tensorrt-llm

UpdatedSep 26, 2024
C++

fgblanch /OutlookLLM

Star42

Add-in for new Outlook that adds LLM new features (Composition, Summarizing, Q&A). It uses a local LLM via Nvidia TensorRT-LLM

outlook-addin tensorrt-llm

UpdatedJun 5, 2025
Python

CactusQ /TensorRT-LLM-Tutorial

Star23

Getting started with TensorRT-LLM using BLOOM as a case study

jupyter-notebook deeplearning tensorrt tensorrt-inference llms llm-inference tensorrt-llm

UpdatedMar 7, 2024
Jupyter Notebook

lix19937 /llm-deploy

Star21

AI Infra LLM infer/ tensorrt-llm/ vllm

llm llm-inference tensorrt-llm

UpdatedDec 17, 2024
Python

zRzRzRzRzRzRzR /lm-fly

Sponsor

Star20

大模型推理框架加速，让 LLM 飞起来

mlx tgi openvino llm vllm llm-inference tensorrt-llm

UpdatedMay 10, 2024
Python

EdVince /whisper-trtllm

Star16

Whisper in TensorRT-LLM

cuda transformers openai whisper asr tensorrt huggingface tensorrt-llm

UpdatedSep 21, 2023
C++

wcks13589 /LLM-Tutorial

Star11

LLM tutorial materials include but not limited to NVIDIA NeMo, TensorRT-LLM, Triton Inference Server, and NeMo Guardrails.

nemo nvidia-nemo llm nemo-guardrails tensorrt-llm

UpdatedJun 26, 2025
Python

Delxrius /MiniMax-01

Star5

MiniMax-01 is a simple implementation of the MiniMax algorithm, a widely used strategy for decision-making in two-player turn-based games like Tic-Tac-Toe. The algorithm aims to minimize the maximum possible loss for the player, making it a popular choice for developing AI opponents in various game scenarios.

chatbot minimax chat-api llm llm-inference flash-attention tensorrt-llm paged-attention deepseek hailuoai deepseek-v3 minimax-text-01 minimax-vl-01 minimax-01

UpdatedOct 13, 2025

Improve this page

Add a description, image, and links to thetensorrt-llm topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thetensorrt-llm topic, visit your repo's landing page and select "manage topics."

Learn more

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly