Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

server: add auto-sleep after N seconds of idle#18228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ngxson wants to merge5 commits intoggml-org:master
base:master
Choose a base branch
Loading
fromngxson:xsn/server_sleep

Conversation

@ngxson
Copy link
Collaborator

@ngxsonngxson commentedDec 20, 2025
edited
Loading

Sleeping on Idle

The server supports an automatic sleep mode that activates after a specified period of inactivity (no incoming tasks). This feature, introduced inPR #18228, can be enabled using the--sleep-idle-seconds command-line argument. It works seamlessly in both single-model and multi-model configurations.

When the server enters sleep mode, the model and its associated memory (including the KV cache) are unloaded from RAM to conserve resources. Any new incoming task will automatically trigger the model to reload.

Note that the following endpoints are exempt from being considered as incoming tasks. They do not trigger model reloading and do not reset the idle timer:

  • GET /health
  • GET /props

Implementation

The implementation of this feature consists of 3 main parts:

  • server_queue sleeping state
  • server_context sleeping state
  • server_res_generator hook

The main loop insideserver_queue acts as a watchdog timer (so we can avoid spawning a dedicated thread just for the watchdog). Upon timing condition passed, it signals toserver_context to unload the model.

server_res_generator hooks on any incoming request, and will ask theserver_queue to resume if it is in sleeping state. Note that some requests like/health bypass this check (they can only access read-only data ofserver_context)

Upon requested to resume,server_queue signalsserver_context to reload models, then unblockserver_res_generator to proceed with the rest of the request.

ServeurpersoCom and espen96 reacted with thumbs up emojiServeurpersoCom reacted with eyes emoji
@ServeurpersoCom
Copy link
Collaborator

Another cool feature! Rebased it on my testing-branch+master to test it out!

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commentedDec 20, 2025
edited
Loading

Minimal test as a global:

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json[*]fit = off  ; Disable automatic memory fittingngl = 999  ; Full GPU offloadctk = q8_0 ; KV cache key quantizationctv = q8_0 ; KV cache value quantizationfa = on    ; Enable flash attentionmlock = on ; Lock model in RAMnp = 4     ; Parallel request batchingkvu = on   ; Unified KV cache buffersleep_idle_seconds = 60 ; Testing[Dense-Devstral-Small-2-24B-Instruct-2512]m = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.ggufc = 131072load-on-startup = 1[my-other-models]...

Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep_idle_seconds' doesn't work yet? Same for any models I load manualy.

This feature will be useful for my real use case: unloading large MoE models that spill over from VRAM into system RAM

[MoE-Uncensored-GLM-4.5-Air-Derestricted-106B]m = bartowski/ArliAI_GLM-4.5-Air-Derestricted-GGUF/ArliAI_GLM-4.5-Air-Derestricted-Q4_K_M-00001-of-00002.ggufn-cpu-moe = 30c = 32768sleep_idle_seconds = 60

I try this also.

@ngxson
Copy link
CollaboratorAuthor

ngxson commentedDec 20, 2025
edited
Loading

Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep_idle_seconds' doesn't work yet? Same for any models I load manualy.

this feature does not unload the model instance, it is independent from router mode

instead, monitor your log and you will see log lines like this:

que    start_loop: entering sleeping statesrv  handle_sleep: server is entering sleeping state

we don't unload the whole process because#18189 (comment)

ServeurpersoCom reacted with thumbs up emoji

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commentedDec 20, 2025
edited
Loading

Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep_idle_seconds' doesn't work yet? Same for any models I load manualy.

this feature does not unload the model instance, it is independent from router mode

instead, monitor your log and you will see log lines like this:

que    start_loop: entering sleeping statesrv  handle_sleep: server is entering sleeping state

we don't unload the whole process because#18189 (comment)

Got it! I was expecting the child process to be killed, but it's an internal model unload within the process itself.
I'll monitor RSS and look for the "entering/exiting sleeping state" log lines.
Testing now with standalone mode first to validate the feature, then I'll integrate it with router mode. Thanks for the clarification!

The internal sleep approach (keeping process alive) is much cleaner than kill/respawn.
Looking forward to the future "sleep levels" feature to fine-tune which components get unloaded!

ngxson reacted with thumbs up emoji

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@ggerganovggerganovAwaiting requested review from ggerganovggerganov is a code owner

@ServeurpersoComServeurpersoComAwaiting requested review from ServeurpersoCom

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

examplespythonpython script changesserver

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@ngxson@ServeurpersoCom

[8]ページ先頭

©2009-2025 Movatter.jp