Movatterモバイル変換

Skip to content

#

llm-serving

Here are 127 public repositories matching this topic...

Language:All

Filter by language

All127 Python71 Jupyter Notebook14 Go6 Shell5 C++4 TypeScript4 Dockerfile3 Rust3 HTML2 C#1

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

vllm-project /vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

amd cuda inference pytorch transformer openai moe llama gpt model-serving tpu kimi blackwell llm llm-serving qwen deepseek deepseek-v3 qwen3 gpt-oss

UpdatedDec 17, 2025
Python

ray-project /ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

python data-science machine-learning reinforcement-learning deep-learning deployment tensorflow optimization parallel pytorch distributed hyperparameter-optimization ray hyperparameter-search serving rllib large-language-models llm llm-serving llm-inference

UpdatedDec 17, 2025
Python

liguodongiot /llm-action

本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）

llm llmops llm-serving llm-training llm-inference

UpdatedDec 3, 2025
HTML

sgl-project /sglang

SGLang is a fast serving framework for large language models and vision language models.

cuda inference pytorch transformer openai moe llama vlm kimi blackwell llm llm-serving llava deepseek llama3 deepseek-v3 deepseek-r1 qwen3 gpt-oss deepseek-v3-2

UpdatedDec 17, 2025
Python

NVIDIA /TensorRT-LLM

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

cuda pytorch moe blackwell llm-serving

UpdatedDec 17, 2025
Python

bentoml /OpenLLM

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

llama mistral fine-tuning mlops bentoml vicuna llm model-inference llmops llm-serving llm-inference open-source-llm llama2 openllm llm-ops llama3-1 llama3-2 llama3-2-vision

UpdatedDec 15, 2025
Python

skypilot-org /skypilot

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 20+ clouds, or on-prem).

data-science machine-learning deep-learning gpu job-scheduler cloud-management spot-instances cloud-computing job-queue hyperparameter-tuning cost-optimization distributed-training multicloud ml-infrastructure tpu cost-management finops ml-platform llm-serving llm-training

UpdatedDec 17, 2025
Python

BentoML

bentoml /BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

python machine-learning deep-learning model-serving multimodal mlops ml-engineering ai-inference llm generative-ai llmops llm-serving model-inference-service llm-inference inference-platform

UpdatedDec 15, 2025
Python

superduper

superduper-io /superduper

Superduper: End-to-end framework for building custom AI applications and agents.

python data database ai mongodb chatbot ml transformers inference torch pytorch pretrained-models semantic-search rag mlops distributed-ml vector-search llmops llm-serving llm-inference

UpdatedSep 1, 2025
Python

gpustack /gpustack

GPU cluster manager for optimized AI model deployment

cuda inference openai llama maas rocm ascend llm llm-serving vllm genai llm-inference qwen deepseek sglang distributed-inference high-performance-inference mindie

UpdatedDec 17, 2025
Python

PaddlePaddle /FastDeploy

High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

inference openai serving ernie llm llm-serving vllm ernie-45 ernie-45-vl

UpdatedDec 17, 2025
Python

predibase /lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

transformers pytorch llama gpt lora model-serving fine-tuning llm llmops llm-serving llm-inference

UpdatedMay 21, 2025
Python

microsoft /aici

AICI: Prompts as (Wasm) Programs

rust ai wasm inference transformer language-model model-serving wasmtime llm llmops llm-serving llm-inference llm-framework

UpdatedJan 22, 2025
Rust

MoonshotAI /MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

pytorch transformer moe llm llm-serving llm-training flash-attention

UpdatedApr 3, 2025
Python

vllm-project /vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

inference transformer model-serving mlops ascend llm llmops llm-serving vllm

UpdatedDec 17, 2025
Python

thu-pacman /chitu

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

gpu pytorch model-serving llm llm-serving deepseek

UpdatedDec 17, 2025
Python

ray-project /ray-llm

RayLLM - LLMs on Ray (Archived). Read README for more info.

ray llm llm-serving

UpdatedMar 13, 2025

GradientHQ /parallax

Parallax is a distributed model serving framework that lets you build your own AI cluster anywhere

python distributed-systems chatbot pytorch transformer llama glm minimax kimi blackwell large-language-models llm llm-serving qwen deepseek oss-gpt decentralized-inference

UpdatedDec 17, 2025
Python

alibaba /rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

inference llama gpt model-serving llm llmops llm-serving

UpdatedDec 17, 2025
C++

efeslab /Nanoflow

A throughput-oriented high-performance serving framework for LLMs

cuda inference model-serving llm llm-serving llama2

UpdatedOct 29, 2025
Jupyter Notebook

Improve this page

Add a description, image, and links to thellm-serving topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thellm-serving topic, visit your repo's landing page and select "manage topics."

[8]ページ先頭

©2009-2025 Movatter.jp