- Notifications
You must be signed in to change notification settings - Fork1.1k
WIP feat: support retries/hard failure limit for prebuilds in reconcile prior to provisioner#21326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
The prebuilds reconcile loop can currently spam workspace creationattempts that are always going to fail, such as when required dynamicparameters are missing or unresolved.This change introduces an in-memory per-preset backoff mechanism thattracks consecutive creation failures and delays subsequent creationattempts using linear backoff:- First failure: backoff for 1x interval (default 1 minute)- Second consecutive failure: backoff for 2x interval (2 minutes)- Third consecutive failure: backoff for 3x interval (3 minutes)- And so on...When a creation succeeds, the failure tracking is cleared and anysubsequent failure starts backoff from 1x interval again.This complements the existing database-based backoff system bypreventing immediate retry spam when creation fails quickly (e.g.,due to missing parameters), while still allowing periodic retriesand recovery when issues are fixed.🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Encapsulates the failure tracking logic into a dedicatedpresetFailuresTracker struct with its own methods, improving codeorganization and separation of concerns.Changes:- Created presetFailuresTracker struct with failures map and mutex- Moved RecordFailure, RecordSuccess, and ShouldBackoff methods to the tracker- Updated StoreReconciler to hold a failureTracker instance- Updated all call sites to use the new tracker methods🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…hit hard limitPreviously, the failure tracking applied backoff to all creation errorsregardless of type. This meant that permanent configuration errors (400s)would be retried with backoff, delaying the hard failure limit detection.Now we distinguish between error types:- Transient errors (500-level): Database failures, network issues, etc. These trigger the backoff mechanism to reduce spam during outages.- Config errors (400-level): Missing parameters, validation failures, etc. These skip backoff and fail immediately, counting toward the hard limit.This allows presets with permanent issues (e.g., missing dynamic parameters)to hit the hard failure limit (default 3 consecutive failures) and get markedas PrebuildStatusHardLimited, while transient infrastructure issues getincremental backoff without blocking the hard limit detection.The hard limit system (GetPresetsAtFailureLimit) tracks the last N buildsin the database. When all N fail with job_status='failed', the preset ismarked hard-limited and creation is blocked entirely until the issue isresolved.🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previously, when wsbuilder.Build() failed with config errors (400-level),no workspace or build records were created because the transaction rolled back.This meant config errors never counted toward the hard failure limit, whichqueries workspace_latest_builds.job_status in the database.Now when Build() fails with a non-transient error (non-500):1. The workspace is created2. A provisioner job is created and immediately marked as failed3. A workspace build is created linking to the failed job4. The transaction commits, preserving the failure recordThis allows the hard limit detection (GetPresetsAtFailureLimit) to seethese failures and mark presets as PrebuildStatusHardLimited after Nconsecutive config errors (default 3).The flow is now:- Transient errors (500s): Trigger in-memory backoff, no DB record- Config errors (400s): Create failed DB record, count toward hard limit- After N config failures: Preset marked hard-limited, creation blockedImplementation details:- createFailedBuildRecord() mimics successful build creation but marks job as failed immediately using UpdateProvisionerJobWithCompleteWithStartedAtByID- Uses dbauthz.AsProvisionerd(ctx) to authorize the job completion- job_status is a generated column: becomes 'failed' when completed_at is set and error is non-empty🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This test verifies that when createPrebuiltWorkspace encounters aconfig error (HTTP 400-level error from wsbuilder.Build), it createsa failed build record in the database so the error counts toward thehard failure limit.The test (TestConfigErrorCreatesFailedBuildRecord):- Creates a template with a required mutable parameter- Creates a preset without providing the required parameter- Runs reconciliation which triggers wsbuilder.Build failure- Verifies workspace and failed build record are created- Verifies provisioner job has job_status='failed'- Verifies the failure appears in GetPresetsAtFailureLimit queryNote: This test requires postgres (via dbtestutil.NewDB) to run thefull reconciliation flow including complex SQL queries. The test willrun successfully in CI where postgres is available.This completes the config error handling implementation from commit4, ensuring that permanent configuration issues properly count towardthe hard failure limit and trigger preset suspension after N failures.
I have read the CLA Document and I hereby sign the CLA Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, pleaseadd the email address used for this commit to your account. |
This PR is meant to address the recent issues we saw regarding heavy CPU/memory usage in profiling related prebuilds, specifically endless retrying of prebuilds for presets that have invalid config and thus will never succeed.
The wsbuilder.Build is where this error can occur, though IIUC there can also be transient errors there (for example network connection issues or DB load related), so the changes here are twofold:
GetPresetsAtFailureLimitquery