Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork221
fix: set restart limits to 0 to prevent being marked as failed#1952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:develop
Are you sure you want to change the base?
Conversation
samrose left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We'll need to create a testing AMI to thoroughly test these changes out. Will request@LGUG2Z to perform these tests as he's also going to be helping us find ways to automate these testing approaches.
samrose left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
When we ultimately merge this, we should bump the versions in ansible/vars.yml to create a release for these changes. This way, it will be a distinct change instead of bundled with other changes.
cstockton commentedDec 8, 2025
Hi@samrose - I've just updated the branch. Any updates on this? |
e097bf1 to3ef31baCompareThe systemd default is 10s / 5 for these values with a DefaultRestartUSec of100ms. Most services set a RestartSec limit of 3, under most circumstances ittakes 15s to restart 5 times so the limit of 10s is not exceeded. However ifother system processes (salt, cloud init) restart it explicitly, or recoveringsystem services within the --before chain trigger a restart the limit can beexceeded causing it to be marked as failed. Since no services markgotrue.service as required it will remain offline until the next explicitrestart is issued.Setting these values to 0 with Restart=always and RestartSec=3 will preventgotrue from being marked as failed.
I've noticed all !oneshot services set a `RestartSec` of `3s` and we use thesystemd defaults of `StartLimitBurst=5` and `StartLimitInterval=10s`. Togetherthis forms a property that under typical conditions a service will be restartedindefinitely until it comes back up due to `(3s * 5) > 10s`, but it is stillpossible for a service to enter a failed state under some scenarios. This changedefensively sets them to 0/0 to keep them in restart loops.
3ef31ba toc89c805Compare
samrose left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Just needs a rebase
samrose left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would like infra data@Crispy1975 or@delgado3d to review when they have some time, just being defensive about changes which could impact stability and we need more eyes on these changes
The systemd default is 10s / 5 for these values with a DefaultRestartUSec of 100ms. Most services set a RestartSec limit of 3, under most circumstances it takes 15s to restart 5 times so the limit of 10s is not exceeded. However if other system processes (salt, cloud init) restart it explicitly, or recovering system services within the --before chain trigger a restart the limit can be exceeded causing it to be marked as failed. Since no services mark gotrue.service as required it will remain offline until the next explicit restart is issued.
Setting these values to 0 with Restart=always and RestartSec=3 will prevent gotrue from being marked as failed.