bentoml/OpenLLMPublic

NotificationsYou must be signed in to change notification settings
Fork753
Star11.6k

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

License

Apache-2.0 license

11.6k stars 753 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,920 Commits
.github		.github
src/openllm		src/openllm
.editorconfig		.editorconfig
.envrc.template		.envrc.template
.git-blame-ignore-revs		.git-blame-ignore-revs
.git_archival.txt		.git_archival.txt
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version-default		.python-version-default
.ruff.toml		.ruff.toml
CITATION.cff		CITATION.cff
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
README.md		README.md
README.md.tpl		README.md.tpl
gen_readme.py		gen_readme.py
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
release.sh		release.sh
uv.lock		uv.lock

Repository files navigation

🦾 OpenLLM: Self-Hosting LLMs Made Easy

OpenLLM allows developers to runany open-source LLMs (Llama 3.3, Qwen2.5, Phi3 andmore) orcustom models asOpenAI-compatible APIs with a single command. It features abuilt-in chat UI, state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Docker, Kubernetes, andBentoCloud.

Understand thedesign philosophy of OpenLLM.

Get Started

Run the following commands to install OpenLLM and explore it interactively.

pip install openllm# or pip3 install openllmopenllm hello

Supported models

OpenLLM supports a wide range of state-of-the-art open-source LLMs. You can also add amodel repository to run custom models with OpenLLM.

Model	Parameters	Required GPU	Start a Server
deepseek	r1-671b	80Gx16	`openllm serve deepseek:r1-671b`
gemma2	2b	12G	`openllm serve gemma2:2b`
gemma3	3b	12G	`openllm serve gemma3:3b`
jamba1.5	mini-ff0a	80Gx2	`openllm serve jamba1.5:mini-ff0a`
llama3.1	8b	24G	`openllm serve llama3.1:8b`
llama3.2	1b	24G	`openllm serve llama3.2:1b`
llama3.3	70b	80Gx2	`openllm serve llama3.3:70b`
llama4	17b16e	80Gx8	`openllm serve llama4:17b16e`
mistral	8b-2410	24G	`openllm serve mistral:8b-2410`
mistral-large	123b-2407	80Gx4	`openllm serve mistral-large:123b-2407`
phi4	14b	80G	`openllm serve phi4:14b`
pixtral	12b-2409	80G	`openllm serve pixtral:12b-2409`
qwen2.5	7b	24G	`openllm serve qwen2.5:7b`
qwen2.5-coder	3b	24G	`openllm serve qwen2.5-coder:3b`
qwq	32b	80G	`openllm serve qwq:32b`

For the full model list, see theOpenLLM models repository.

Start an LLM server

To start an LLM server locally, use theopenllm serve command and specify the model version.

Note

OpenLLM does not store model weights. A Hugging Face token (HF_TOKEN) is required for gated models.

Create your Hugging Face tokenhere.
Request access to the gated model, such asmeta-llama/Llama-3.2-1B-Instruct.
Set your token as an environment variable by running:
```
export HF_TOKEN=<your token>
```

openllm serve llama3.2:1b

The server will be accessible athttp://localhost:3000, providing OpenAI-compatible APIs for interaction. You can call the endpoints with different frameworks and tools that support OpenAI-compatible APIs. Typically, you may need to specify the following:

The API host address: By default, the LLM is hosted athttp://localhost:3000.
The model name: The name can be different depending on the tool you use.
The API key: The API key used for client authentication. This is optional.

Here are some examples:

OpenAI Python client

fromopenaiimportOpenAIclient=OpenAI(base_url='http://localhost:3000/v1',api_key='na')# Use the following func to get the available models# model_list = client.models.list()# print(model_list)chat_completion=client.chat.completions.create(model="meta-llama/Llama-3.2-1B-Instruct",messages=[        {"role":"user","content":"Explain superconductors like I'm five years old"        }    ],stream=True,)forchunkinchat_completion:print(chunk.choices[0].delta.contentor"",end="")

LlamaIndex

fromllama_index.llms.openaiimportOpenAIllm=OpenAI(api_bese="http://localhost:3000/v1",model="meta-llama/Llama-3.2-1B-Instruct",api_key="dummy")...

Chat UI

OpenLLM provides a chat UI at the/chat endpoint for the launched LLM server athttp://localhost:3000/chat.

Chat with a model in the CLI

To start a chat conversation in the CLI, use theopenllm run command and specify the model version.

openllm run llama3:8b

Model repository

A model repository in OpenLLM represents a catalog of available LLMs that you can run. OpenLLM provides a default model repository that includes the latest open-source LLMs like Llama 3, Mistral, and Qwen2, hosted atthis GitHub repository. To see all available models from the default and any added repository, use:

openllm model list

To ensure your local list of models is synchronized with the latest updates from all connected repositories, run:

openllm repo update

To review a model’s information, run:

openllm model get llama3.2:1b

Add a model to the default model repository

You can contribute to the default model repository by adding new models that others can use. This involves creating and submitting a Bento of the LLM. For more information, check out thisexample pull request.

Set up a custom repository

You can add your own repository to OpenLLM with custom models. To do so, follow the format in the default OpenLLM model repository with abentos directory to store custom LLMs. You need tobuild your Bentos with BentoML and submit them to your model repository.

First, prepare your custom models in abentos directory following the guidelines provided byBentoML to build Bentos. Check out thedefault model repository for an example and read theDeveloper Guide for details.

Then, register your custom model repository with OpenLLM:

openllm repo add<repo-name><repo-url>

Note: Currently, OpenLLM only supports adding public repositories.

Deploy to BentoCloud

OpenLLM supports LLM cloud deployment via BentoML, the unified model serving framework, and BentoCloud, an AI inference platform for enterprise AI teams. BentoCloud provides fully-managed infrastructure optimized for LLM inference with autoscaling, model orchestration, observability, and many more, allowing you to run any AI model in the cloud.

openllm deploy llama3.2:1b --env HF_TOKEN

Note

If you are deploying a gated model, make sure to set HF_TOKEN in enviroment variables.

Once the deployment is complete, you can run model inference on the BentoCloud console:

Community

OpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy to use 👉 Join our Slack community!

Contributing

As an open-source project, we welcome contributions of all kinds, such as new features, bug fixes, and documentation. Here are some of the ways to contribute:

Repost a bug bycreating a GitHub issue.
Submit a pull request or help review other developers’pull requests.
Add an LLM to the OpenLLM default model repository so that other users can run your model. See thepull request template.
Check out theDeveloper Guide to learn more.