Movatterモバイル変換

NotificationsYou must be signed in to change notification settings
Fork891
Star8.3k

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

bentoml.com

License

Apache-2.0 license

8.3k stars 891 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3,698 Commits
.devcontainer		.devcontainer
.github		.github
.hyperlint		.hyperlint
bazel		bazel
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
tools		tools
typings		typings
.bazelignore		.bazelignore
.bazelrc		.bazelrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version-default		.python-version-default
.readthedocs.yaml		.readthedocs.yaml
.yamllint.yml		.yamllint.yml
BUILD.bazel		BUILD.bazel
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
WORKSPACE		WORKSPACE
codecov.yml		codecov.yml
noxfile.py		noxfile.py
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Repository files navigation

Unified Model Serving Framework

🍱 Build model inference APIs and multi-model serving systems with any open-source or custom AI models. 👉Join our Slack community!

What is BentoML?

BentoML is a Python library for building online serving systems optimized for AI apps and model inference.

🍱 Easily build APIs for Any AI/ML Model. Turn any model inference script into a REST API server with just a few lines of code and standard Python type hints.
🐳 Docker Containers made simple. No more dependency hell! Manage your environments, dependencies and model versions with a simple config file. BentoML automatically generates Docker images, ensures reproducibility, and simplifies how you deploy to different environments.
🧭 Maximize CPU/GPU utilization. Build high performance inference APIs leveraging built-in serving optimization features like dynamic batching, model parallelism, multi-stage pipeline and multi-model inference-graph orchestration.
👩‍💻 Fully customizable. Easily implement your own APIs or task queues, with custom business logic, model inference and multi-model composition. Supports any ML framework, modality, and inference runtime.
🚀 Ready for Production. Develop, run and debug locally. Seamlessly deploy to production with Docker containers orBentoCloud.

Getting started

Install BentoML:

# Requires Python≥3.9pip install -U bentoml

Define APIs in a service.py file.

importbentoml@bentoml.service(image=bentoml.images.Image(python_version="3.11").python_packages("torch","transformers"),)classSummarization:def__init__(self)->None:importtorchfromtransformersimportpipelinedevice="cuda"iftorch.cuda.is_available()else"cpu"self.pipeline=pipeline('summarization',device=device)@bentoml.api(batchable=True)defsummarize(self,texts:list[str])->list[str]:results=self.pipeline(texts)return [item['summary_text']foriteminresults]

💻 Run locally

Install PyTorch and Transformers packages to your Python virtual environment.

pip install torch transformers# additional dependencies for local run

Run the service code locally (serving athttp://localhost:3000 by default):

bentoml serve

You should expect to see the following output.

[INFO] [cli] Starting production HTTP BentoServer from "service:Summarization" listening on http://localhost:3000 (Press CTRL+C to quit)[INFO] [entry_service:Summarization:1] Service Summarization initialized

Now you can run inference from your browser athttp://localhost:3000 or with a Python script:

importbentomlwithbentoml.SyncHTTPClient('http://localhost:3000')asclient:summarized_text:str=client.summarize([bentoml.__doc__])[0]print(f"Result:{summarized_text}")

🐳 Deploy using Docker

Runbentoml build to package necessary code, models, dependency configs into a Bento - the standardized deployable artifact in BentoML:

bentoml build

EnsureDocker is running. Generate a Docker container image for deployment:

bentoml containerize summarization:latest

Run the generated image:

docker run --rm -p 3000:3000 summarization:latest

☁️ Deploy on BentoCloud

BentoCloud provides compute infrastructure for rapid and reliable GenAI adoption. It helps speed up your BentoML development process leveraging cloud compute resources, and simplify how you deploy, scale and operate BentoML in production.

# After signup, run the following command to create an API token:bentoml cloud login# Deploy from current directory:bentoml deploy

For detailed explanations, read theHello World example.

Examples

LLMs:Llama 3.2,Mistral,DeepSeek Distil, and more.
Image Generation:Stable Diffusion 3 Medium,Stable Video Diffusion,Stable Diffusion XL Turbo,ControlNet, andLCM LoRAs.
Embeddings:SentenceTransformers andColPali
Audio:ChatTTS,XTTS,WhisperX,Bark
Computer Vision:YOLO andResNet
Advanced examples:Function calling,LangGraph,CrewAI

Check out thefull list for more sample code and usage.

Advanced topics

SeeDocumentation for more tutorials and guides.

Community

Get involved and join ourCommunity Slack 💬, where thousands of AI/ML engineers help each other, contribute to the project, and talk about building AI products.

To report a bug or suggest a feature request, useGitHub Issues.

Contributing

There are many ways to contribute to the project:

Report bugs and "Thumbs up" onissues that are relevant to you.
Investigateissues and review other developers'pull requests.
Contribute code ordocumentation to the project by submitting a GitHub pull request.
Check out theContributing Guide andDevelopment Guide to learn more.
Share your feedback and discuss roadmap plans in the#bentoml-contributors channelhere.

Thanks to all of our amazing contributors!

Usage tracking and feedback

The BentoML framework collects anonymous usage data that helps our community improve the product. Only BentoML's internal API calls are being reported. This excludes any sensitive information, such as user code, model data, model names, or stack traces. Here's the code used for usage tracking. You can opt-out of usage tracking by the --do-not-track CLI option: