pytorch/servePublic

NotificationsYou must be signed in to change notification settings
Fork881
Star4.3k

Serve, optimize and scale PyTorch models in production

License

Apache-2.0 license

4.3k stars 881 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3,891 Commits
.github		.github
benchmarks		benchmarks
binaries		binaries
ci		ci
cpp		cpp
docker		docker
docs		docs
examples		examples
frontend		frontend
kubernetes		kubernetes
model-archiver		model-archiver
plugins		plugins
requirements		requirements
serving-sdk		serving-sdk
test		test
third_party/google		third_party/google
ts		ts
ts_scripts		ts_scripts
workflow-archiver		workflow-archiver
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PyPiDescription.rst		PyPiDescription.rst
README.md		README.md
SECURITY.md		SECURITY.md
_config.yml		_config.yml
codecov.yml		codecov.yml
link_check_config.json		link_check_config.json
mypy.ini		mypy.ini
setup.py		setup.py
torchserve_sanity.py		torchserve_sanity.py

Repository files navigation

⚠️ Notice: Limited Maintenance

This project is no longer actively maintained. While existing releases remain available, there are no planned updates, bug fixes, new features, or security patches. Users should be aware that vulnerabilities may not be addressed.

❗ANNOUNCEMENT: Security Changes❗

TorchServe now enforces token authorization enabled and model API control disabled by default. These security features are intended to address the concern of unauthorized API calls and to prevent potential malicious code from being introduced to the model server. Refer the following documentation for more information:Token Authorization,Model API control

TorchServe

TorchServe is a flexible and easy-to-use tool for serving and scaling PyTorch models in production.

Requires python >= 3.8

curl http://127.0.0.1:8080/predictions/bert -T input.txt

🚀 Quick start with TorchServe

# Install dependenciespython ./ts_scripts/install_dependencies.py# Include dependencies for accelerator support with the relevant optional flagspython ./ts_scripts/install_dependencies.py --rocm=rocm61python ./ts_scripts/install_dependencies.py --cuda=cu121# Latest releasepip install torchserve torch-model-archiver torch-workflow-archiver# Nightly buildpip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archiver-nightly

🚀 Quick start with TorchServe (conda)

# Install dependenciespython ./ts_scripts/install_dependencies.py# Include depeendencies for accelerator support with the relevant optional flagspython ./ts_scripts/install_dependencies.py --rocm=rocm61python ./ts_scripts/install_dependencies.py --cuda=cu121# Latest releaseconda install -c pytorch torchserve torch-model-archiver torch-workflow-archiver# Nightly buildconda install -c pytorch-nightly torchserve torch-model-archiver torch-workflow-archiver

Getting started guide

🐳 Quick Start with Docker

# Latest releasedocker pull pytorch/torchserve# Nightly builddocker pull pytorch/torchserve-nightly

Refer totorchserve docker for details.

🤖 Quick Start LLM Deployment

VLLM Engine

# Make sure to install torchserve with pip or conda as described above and login with `huggingface-cli login`python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth# Try it outcurl -X POST -d'{"model":"meta-llama/Llama-3.2-3B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header"Content-Type: application/json""http://localhost:8080/predictions/model/1.0/v1/completions"

TRT-LLM Engine

# Make sure to install torchserve with python venv as described above and login with `huggingface-cli login`# pip install -U --use-deprecated=legacy-resolver -r requirements/trt_llm.txtpython -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --engine trt_llm --disable_token_auth# Try it outcurl -X POST -d'{"prompt":"count from 1 to 9 in french ", "max_tokens": 100}' --header"Content-Type: application/json""http://localhost:8080/predictions/model"

🚢 Quick Start LLM Deployment with Docker

#export token=<HUGGINGFACE_HUB_TOKEN>docker build --pull. -f docker/Dockerfile.vllm -t ts/vllmdocker run --rm -ti --shm-size 10g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth# Try it outcurl -X POST -d'{"model":"meta-llama/Meta-Llama-3-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header"Content-Type: application/json""http://localhost:8080/predictions/model/1.0/v1/completions"

Refer toLLM deployment for details and other methods.

⚡ Why TorchServe

Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs,Nvidia MPS
Model Management API: multi model management with optimized worker to model allocation
Inference API: REST and gRPC support for batched inference
TorchServe Workflows: deploy complex DAGs with multiple interdependent models
Default way to serve PyTorch models in
- Sagemaker
- Vertex AI
- Kubernetes with support forautoscaling, session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS
- Kserve: Supports both v1 and v2 API,autoscaling and canary deployments for A/B testing
- Kubeflow
- MLflow
Export your model for optimized inference. Torchscript out of the box,PyTorch Compiler preview,ORT and ONNX,IPEX,TensorRT,FasterTransformer, FlashAttention (Better Transformers)
Performance Guide: builtin support to optimize, benchmark, and profile PyTorch and TorchServe performance
Expressive handlers: An expressive handler architecture that makes it trivial to support inferencing for your use case withmany supported out of the box
Metrics API: out-of-the-box support for system-level metrics withPrometheus exports, custom metrics,
Large Model Inference Guide: With support for GenAI, LLMs including
- SOTA GenAI performance usingtorch.compile
- Fast Kernels with FlashAttention v2, continuous batching and streaming response
- PyTorchTensor Parallel preview,Pipeline Parallel
- MicrosoftDeepSpeed,DeepSpeed-Mii
- Hugging FaceAccelerate,Diffusers
- Running large models on AWSSagemaker andInferentia2
- RunningMeta Llama Chatbot locally on Mac
Monitoring using Grafana andDatadog

🤔 How does TorchServe work

Model Server for PyTorch Documentation: Full documentation
TorchServe internals: How TorchServe was built
Contributing guide: How to contribute to TorchServe

🏆 Highlighted Examples

Serving Meta Llama with TorchServe
Chatbot with Meta Llama on Mac 🦙💬
🤗 HuggingFace Transformers with aBetter Transformer Integration/ Flash Attention & Xformer Memory Efficient
Stable Diffusion
Model parallel inference
MultiModal models with MMF combining text, audio and video
Dual Neural Machine Translation for a complex workflow DAG
TorchServe Integrations
TorchServe Internals
TorchServe UseCases

Formore examples

🛡️ TorchServe Security Policy

SECURITY.md

🤓 Learn More

https://pytorch.org/serve

🫂 Contributing

We welcome all contributions!

To learn more about how to contribute, see the contributor guidehere.

📰 News

💖 All Contributors

Made withcontrib.rocks.

⚖️ Disclaimer

This repository is jointly operated and maintained by Amazon, Meta and a number of individual contributors listed in theCONTRIBUTORS file. For questions directed at Meta, please send an email toopensource@fb.com. For questions directed at Amazon, please send an email totorchserve@amazon.com. For all other questions, please open up an issue in this repositoryhere.

TorchServe acknowledges theMulti Model Server (MMS) project from which it was derived