llm-serving
Here are 109 public repositories matching this topic...
Language:All
Sort:Most stars
A high-throughput and memory-efficient inference and serving engine for LLMs
- Updated
Jul 18, 2025 - Python
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
- Updated
Jul 18, 2025 - Python
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
- Updated
Jul 10, 2025 - HTML
SGLang is a fast serving framework for large language models and vision language models.
- Updated
Jul 18, 2025 - Python
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
- Updated
Jul 14, 2025 - Python
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
- Updated
Jul 18, 2025 - C++
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
- Updated
Jul 18, 2025 - Python
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
- Updated
Jul 18, 2025 - Python
Superduper: End-to-end framework for building custom AI applications and agents.
- Updated
Jul 16, 2025 - Python
High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
- Updated
Jul 18, 2025 - Python
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
- Updated
May 21, 2025 - Python
Simple, scalable AI model deployment on GPU clusters
- Updated
Jul 18, 2025 - Python
AICI: Prompts as (Wasm) Programs
- Updated
Jan 22, 2025 - Rust
MoBA: Mixture of Block Attention for Long-Context LLMs
- Updated
Apr 3, 2025 - Python
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
- Updated
Jul 15, 2025 - Python
A highly optimized LLM inference acceleration engine for Llama and its variants.
- Updated
Jul 10, 2025 - C++
Community maintained hardware plugin for vLLM on Ascend
- Updated
Jul 18, 2025 - Python
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
- Updated
Jul 11, 2025 - Python
A throughput-oriented high-performance serving framework for LLMs
- Updated
Jul 9, 2025 - Jupyter Notebook
Improve this page
Add a description, image, and links to thellm-serving topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thellm-serving topic, visit your repo's landing page and select "manage topics."