Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

WIP feat: support retries/hard failure limit for prebuilds in reconcile prior to provisioner#21326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
cstyan wants to merge5 commits intomain
base:main
Choose a base branch
Loading
fromcallum/prebuild-fail-interval

Conversation

@cstyan
Copy link
Contributor

This PR is meant to address the recent issues we saw regarding heavy CPU/memory usage in profiling related prebuilds, specifically endless retrying of prebuilds for presets that have invalid config and thus will never succeed.

The wsbuilder.Build is where this error can occur, though IIUC there can also be transient errors there (for example network connection issues or DB load related), so the changes here are twofold:

  • a backoff retry in the reconcile loop for prebuilds which hit 500 error codes (transient issues)
  • the writing of a DB entry for 400s (config related issues) that will never succeed until the template is updated, these are the same records that the reconcile loop should already be looking for as part of theGetPresetsAtFailureLimit query

Callum Styanand others added5 commitsDecember 16, 2025 23:51
The prebuilds reconcile loop can currently spam workspace creationattempts that are always going to fail, such as when required dynamicparameters are missing or unresolved.This change introduces an in-memory per-preset backoff mechanism thattracks consecutive creation failures and delays subsequent creationattempts using linear backoff:- First failure: backoff for 1x interval (default 1 minute)- Second consecutive failure: backoff for 2x interval (2 minutes)- Third consecutive failure: backoff for 3x interval (3 minutes)- And so on...When a creation succeeds, the failure tracking is cleared and anysubsequent failure starts backoff from 1x interval again.This complements the existing database-based backoff system bypreventing immediate retry spam when creation fails quickly (e.g.,due to missing parameters), while still allowing periodic retriesand recovery when issues are fixed.🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Encapsulates the failure tracking logic into a dedicatedpresetFailuresTracker struct with its own methods, improving codeorganization and separation of concerns.Changes:- Created presetFailuresTracker struct with failures map and mutex- Moved RecordFailure, RecordSuccess, and ShouldBackoff methods to  the tracker- Updated StoreReconciler to hold a failureTracker instance- Updated all call sites to use the new tracker methods🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…hit hard limitPreviously, the failure tracking applied backoff to all creation errorsregardless of type. This meant that permanent configuration errors (400s)would be retried with backoff, delaying the hard failure limit detection.Now we distinguish between error types:- Transient errors (500-level): Database failures, network issues, etc.  These trigger the backoff mechanism to reduce spam during outages.- Config errors (400-level): Missing parameters, validation failures, etc.  These skip backoff and fail immediately, counting toward the hard limit.This allows presets with permanent issues (e.g., missing dynamic parameters)to hit the hard failure limit (default 3 consecutive failures) and get markedas PrebuildStatusHardLimited, while transient infrastructure issues getincremental backoff without blocking the hard limit detection.The hard limit system (GetPresetsAtFailureLimit) tracks the last N buildsin the database. When all N fail with job_status='failed', the preset ismarked hard-limited and creation is blocked entirely until the issue isresolved.🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previously, when wsbuilder.Build() failed with config errors (400-level),no workspace or build records were created because the transaction rolled back.This meant config errors never counted toward the hard failure limit, whichqueries workspace_latest_builds.job_status in the database.Now when Build() fails with a non-transient error (non-500):1. The workspace is created2. A provisioner job is created and immediately marked as failed3. A workspace build is created linking to the failed job4. The transaction commits, preserving the failure recordThis allows the hard limit detection (GetPresetsAtFailureLimit) to seethese failures and mark presets as PrebuildStatusHardLimited after Nconsecutive config errors (default 3).The flow is now:- Transient errors (500s): Trigger in-memory backoff, no DB record- Config errors (400s): Create failed DB record, count toward hard limit- After N config failures: Preset marked hard-limited, creation blockedImplementation details:- createFailedBuildRecord() mimics successful build creation but marks job  as failed immediately using UpdateProvisionerJobWithCompleteWithStartedAtByID- Uses dbauthz.AsProvisionerd(ctx) to authorize the job completion- job_status is a generated column: becomes 'failed' when completed_at is  set and error is non-empty🤖 Generated with [Claude Code](https://claude.com/claude-code)Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This test verifies that when createPrebuiltWorkspace encounters aconfig error (HTTP 400-level error from wsbuilder.Build), it createsa failed build record in the database so the error counts toward thehard failure limit.The test (TestConfigErrorCreatesFailedBuildRecord):- Creates a template with a required mutable parameter- Creates a preset without providing the required parameter- Runs reconciliation which triggers wsbuilder.Build failure- Verifies workspace and failed build record are created- Verifies provisioner job has job_status='failed'- Verifies the failure appears in GetPresetsAtFailureLimit queryNote: This test requires postgres (via dbtestutil.NewDB) to run thefull reconciliation flow including complex SQL queries. The test willrun successfully in CI where postgres is available.This completes the config error handling implementation from commit4, ensuring that permanent configuration issues properly count towardthe hard failure limit and trigger preset suspension after N failures.
@github-actions
Copy link


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign ourContributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, pleaseadd the email address used for this commit to your account.
You can retrigger this bot by commentingrecheck in this Pull Request.Posted by theCLA Assistant Lite bot.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@ssncferreirassncferreiraAwaiting requested review from ssncferreira

At least 1 approving review is required to merge this pull request.

Assignees

@cstyancstyan

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@cstyan

[8]ページ先頭

©2009-2025 Movatter.jp