- Notifications
You must be signed in to change notification settings - Fork14.2k
server: add auto-sleep after N seconds of idle#18228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Conversation
ServeurpersoCom commentedDec 20, 2025
Another cool feature! Rebased it on my testing-branch+master to test it out! |
ServeurpersoCom commentedDec 20, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Minimal test as a global: Seems to have no effect: the model stays loaded continuously. Perhaps 'load-on-startup = 1' combined with 'sleep_idle_seconds' doesn't work yet? Same for any models I load manualy. This feature will be useful for my real use case: unloading large MoE models that spill over from VRAM into system RAM I try this also. |
ngxson commentedDec 20, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
this feature does not unload the model instance, it is independent from router mode instead, monitor your log and you will see log lines like this: we don't unload the whole process because#18189 (comment) |
ServeurpersoCom commentedDec 20, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Got it! I was expecting the child process to be killed, but it's an internal model unload within the process itself. The internal sleep approach (keeping process alive) is much cleaner than kill/respawn. |
Uh oh!
There was an error while loading.Please reload this page.
Sleeping on Idle
The server supports an automatic sleep mode that activates after a specified period of inactivity (no incoming tasks). This feature, introduced inPR #18228, can be enabled using the
--sleep-idle-secondscommand-line argument. It works seamlessly in both single-model and multi-model configurations.When the server enters sleep mode, the model and its associated memory (including the KV cache) are unloaded from RAM to conserve resources. Any new incoming task will automatically trigger the model to reload.
Note that the following endpoints are exempt from being considered as incoming tasks. They do not trigger model reloading and do not reset the idle timer:
GET /healthGET /propsImplementation
The implementation of this feature consists of 3 main parts:
server_queuesleeping stateserver_contextsleeping stateserver_res_generatorhookThe main loop inside
server_queueacts as a watchdog timer (so we can avoid spawning a dedicated thread just for the watchdog). Upon timing condition passed, it signals toserver_contextto unload the model.server_res_generatorhooks on any incoming request, and will ask theserver_queueto resume if it is in sleeping state. Note that some requests like/healthbypass this check (they can only access read-only data ofserver_context)Upon requested to resume,
server_queuesignalsserver_contextto reload models, then unblockserver_res_generatorto proceed with the rest of the request.