The server supports an automatic sleep mode that activates after a specified period of inactivity (no incoming tasks). This feature, introduced inPR #18228, can be enabled using the--sleep-idle-seconds command-line argument. It works seamlessly in both single-model and multi-model configurations.

When the server enters sleep mode, the model and its associated memory (including the KV cache) are unloaded from RAM to conserve resources. Any new incoming task will automatically trigger the model to reload.

Note that the following endpoints are exempt from being considered as incoming tasks. They do not trigger model reloading and do not reset the idle timer:

GET /health
GET /props

Implementation

The implementation of this feature consists of 3 main parts:

server_queue sleeping state
server_context sleeping state
server_res_generator hook

The main loop insideserver_queue acts as a watchdog timer (so we can avoid spawning a dedicated thread just for the watchdog). Upon timing condition passed, it signals toserver_context to unload the model.

server_res_generator hooks on any incoming request, and will ask theserver_queue to resume if it is in sleeping state. Note that some requests like/health bypass this check (they can only access read-only data ofserver_context)

Upon requested to resume,server_queue signalsserver_context to reload models, then unblockserver_res_generator to proceed with the rest of the request.

ngxson added3 commits

December 20, 2025 15:21

implement sleeping at queue level

e1d7b43

implement server-context suspend

197e578

add test

db3b78d

github-actionsbot added examples server labels

Dec 20, 2025

add docs

aea8f8c

ngxson mentioned this pull request

Dec 20, 2025

Feature Request: Idle model unload timeout (router mode / config.ini)#18189

Open

4 tasks

ngxson marked this pull request as ready for review

December 20, 2025 15:04

ngxson requested a review fromggerganov as acode owner

December 20, 2025 15:04

ngxson requested a review fromServeurpersoCom

December 20, 2025 15:05

optimization: add fast path

44a5a26

github-actionsbot added the pythonpython script changes label

Dec 20, 2025

loci-dev mentioned this pull request

Dec 20, 2025

UPSTREAM PR #18228: server: add auto-sleep after N seconds of idleauroralabs-loci/llama.cpp#640

Open

Copy link

Collaborator

ServeurpersoCom commentedDec 20, 2025

Another cool feature! Rebased it on my testing-branch+master to test it out!

Copy link

Collaborator

ServeurpersoCom commentedDec 20, 2025•
edited
Loading

Minimal test as a global:

; llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json[*]fit = off  ; Disable automatic memory fittingngl = 999  ; Full GPU offloadctk = q8_0 ; KV cache key quantizationctv = q8_0 ; KV cache value quantizationfa = on    ; Enable flash attentionmlock = on ; Lock model in RAMnp = 4     ; Parallel request batchingkvu = on   ; Unified KV cache buffersleep_idle_seconds = 60 ; Testing[Dense-Devstral-Small-2-24B-Instruct-2512]m = unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.ggufc = 131072load-on-startup = 1[my-other-models]...

Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep_idle_seconds' doesn't work yet? Same for any models I load manualy.

This feature will be useful for my real use case: unloading large MoE models that spill over from VRAM into system RAM

[MoE-Uncensored-GLM-4.5-Air-Derestricted-106B]m = bartowski/ArliAI_GLM-4.5-Air-Derestricted-GGUF/ArliAI_GLM-4.5-Air-Derestricted-Q4_K_M-00001-of-00002.ggufn-cpu-moe = 30c = 32768sleep_idle_seconds = 60

I try this also.

Copy link

CollaboratorAuthor

ngxson commentedDec 20, 2025•
edited
Loading

Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep_idle_seconds' doesn't work yet? Same for any models I load manualy.

this feature does not unload the model instance, it is independent from router mode

instead, monitor your log and you will see log lines like this:

que    start_loop: entering sleeping statesrv  handle_sleep: server is entering sleeping state

we don't unload the whole process because#18189 (comment)

Copy link

Collaborator

ServeurpersoCom commentedDec 20, 2025•
edited
Loading

Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep_idle_seconds' doesn't work yet? Same for any models I load manualy.
this feature does not unload the model instance, it is independent from router mode
instead, monitor your log and you will see log lines like this:
que    start_loop: entering sleeping statesrv  handle_sleep: server is entering sleeping state
we don't unload the whole process because#18189 (comment)

Got it! I was expecting the child process to be killed, but it's an internal model unload within the process itself.
I'll monitor RSS and look for the "entering/exiting sleeping state" log lines.
Testing now with standalone mode first to validate the feature, then I'll integrate it with router mode. Thanks for the clarification!

The internal sleep approach (keeping process alive) is much cleaner than kill/respawn.
Looking forward to the future "sleep levels" feature to fine-tune which components get unloaded!

Labels

examples python

python script changes

server

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: add auto-sleep after N seconds of idle#18228

Are you sure you want to change the base?

server: add auto-sleep after N seconds of idle#18228

Conversation

ngxson commentedDec 20, 2025•
edited
Loading

Uh oh!

Sleeping on Idle

Implementation

Uh oh!

ServeurpersoCom commentedDec 20, 2025

Uh oh!

ServeurpersoCom commentedDec 20, 2025•
edited
Loading

Uh oh!

Uh oh!

ngxson commentedDec 20, 2025•
edited
Loading

Uh oh!

Uh oh!

ServeurpersoCom commentedDec 20, 2025•
edited
Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Movatterモバイル変換

server: add auto-sleep after N seconds of idle#18228

Are you sure you want to change the base?

server: add auto-sleep after N seconds of idle#18228

Conversation

ngxson commentedDec 20, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Sleeping on Idle

Implementation

Uh oh!

ServeurpersoCom commentedDec 20, 2025

Uh oh!

ServeurpersoCom commentedDec 20, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ngxson commentedDec 20, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commentedDec 20, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commentedDec 20, 2025•
edited
Loading

ServeurpersoCom commentedDec 20, 2025•
edited
Loading

ngxson commentedDec 20, 2025•
edited
Loading

ServeurpersoCom commentedDec 20, 2025•
edited
Loading