Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

feat: vLLM backend#2010

Draft
gau-nernst wants to merge93 commits intodev
base:dev
Choose a base branch
Loading
fromthien/python_engine
Draft

feat: vLLM backend#2010

gau-nernst wants to merge93 commits intodevfromthien/python_engine

Conversation

gau-nernst
Copy link
Contributor

@gau-nernstgau-nernst commentedFeb 21, 2025
edited
Loading

Describe Your Changes

High-level design

  • vLLM is an inference engine for large-scale (many GPUs)
  • cortex will spawn an vLLM subprocess and route the requests to vLLM

cortex engines install vllm

  • Download uv tocortexcpp/python_engines/bin/uv if uv is not installed
  • (via uv) Setup venv atcortexcpp/python_engines/envs/vllm/<version>/.venv
  • (via uv) Download vllm and its deps
  • Known issues:
    • Progress streaming is not supported (since download is done via uv instead ofDownloadService).
    • It's not async since we need to wait for subprocess to finish (perhaps we will need a newSubprocessService in the future which handles asyncWaitProcess())
    • Hence, stopping and resuming download also does not work.

Note:

  • All cached Python packages are stored incortexcpp/python_engines/cache/uv. The purpose is that when we removepython_engines folder, we are sure that we don't leave anything behind.

cortex models start <model>

  • Spawnvllm serve

TODO:

  • cortex engines install vllm (TODO: async install in separate thread)
  • Set default engine variant
  • cortex engines load vllm
  • cortex engines list
  • cortex engines uninstall vllm: deletecortexcpp/python_engines/envs/vllm/<version>
  • cortex pull <model>
  • cortex models list
  • cortex models start <model>: spawnvllm serve
  • cortex models stop <model>
  • cortex ps
  • Chat completion
    • Non-streaming
    • Streaming
  • Embeddings
  • cortex run

Fixes Issues

Self Checklist

  • Added relevant comments, esp in complex areas
  • Updated docs (for bug fixes / features)
  • Created issues for follow-up changes or refactoring needed

@gau-nernstgau-nernst moved this fromIcebox toIn Progress inMenloMar 20, 2025
@gau-nernstgau-nernst mentioned this pull requestMar 22, 2025
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers
No reviews
Assignees

@gau-nernstgau-nernst

Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

vLLM backend for Cortex
3 participants
@gau-nernst@ramonpzg@vansangpfiev

[8]ページ先頭

©2009-2025 Movatter.jp