- Notifications
You must be signed in to change notification settings - Fork1.1k
fix: send prebuild job notification after job build db commit#20693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
c31db3d to2c77682Compare2c77682 tof161c82Compare| // Publish provisioner job event to notify the acquirer that a new job was posted | ||
| c.publishProvisionerJob(ctx,provisionerJob,prebuiltWorkspaceID) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Should we only do this ifprovisionerJob is not nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
provisionerJob should never benil, we already check it inprovision and return an error in that case:
coder/enterprise/coderd/prebuilds/reconcile.go
Lines 930 to 934 inf161c82
| ifprovisionerJob==nil { | |
| // This should not happen, builder.Build() should either return a job or an error. | |
| // Returning an error to fail fast if we hit this unexpected case. | |
| returnnil,xerrors.Errorf("provision succeeded but returned no job") | |
| } |
Nevertheless, I'm also checking inpublishProvisionerJob if the job isnil, just in case:
coder/enterprise/coderd/prebuilds/reconcile.go
Lines 947 to 949 inf161c82
| ifprovisionerJob==nil { | |
| return | |
| } |
| // Publish provisioner job events to notify the acquirer that new jobs were posted | ||
| forworkspaceID,job:=rangeprovisionerJobs { | ||
| c.publishProvisionerJob(ctx,job,workspaceID) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If the same channel name is signaled multiple times with identical payload strings within the same transaction, only one instance of the notification event is delivered to listeners.
We are now publishing these job events outside of the transaction, so we don't get this automatic de-duping. I'm not sure if this will cause issues with larger deployments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
AFAIK, we were never publishing the events within the transaction. Previously, we would send the job to theprovisionNotifyCh channel, which would then publish the job outside the transactionhttps://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/reconcile.go#L164
Therefore, this deduplication was not happening in the previous implementation as well. The only difference here is that we guarantee that we only publish the job after the transaction commits, ensuring provisioners can see the job when they query the database.
Also, there seems to be a rule to avoid calling publish within a transaction:https://github.com/coder/coder/blob/8274251f/scripts/rules.go#L143
Nevertheless, this is a good point about deduping. Because it seems the job is published as (link):
msg, err := json.Marshal(JobPosting{ OrganizationID: job.OrganizationID, ProvisionerType: job.Provisioner, Tags: job.Tags,})Meaning that all of these canceled jobs would have the same payload. So theoretically, we only need to send one job event, right? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Therefore, this deduplication was not happening in the previous implementation as well.
Fair enough 👍 Just wanted to make sure this was called out.
Also, there seems to be a rule to avoid calling publish within a transaction:https://github.com/coder/coder/blob/8274251f/scripts/rules.go#L143
I completely forgot about that 😁 good callout!
Meaning that all of these canceled jobs would have the same payload. So theoretically, we only need to send one job event, right? 🤔
That's how I understand it, yes!provisionerdserver.Acquirer listens to allprovisioner_job_posted events and 'wakes up' its provisioner if the tags match.
EDIT: now that I think about it more, I'm not sure that pubsub events actually impact canceled jobs.
Uh oh!
There was an error while loading.Please reload this page.
Problem
Fix race condition in prebuilds reconciler. Previously, a job notification event was sent inside the provisioning database transaction. This means that if a provisioner daemon received the notification and queried for the job before the transaction commits, it won't find the job in the database.
This manifested as a flake failure in
TestReinitializeAgent, where provisioners would occasionally miss newly created jobs.The solution consists in moving the job notification outside the database transaction to ensure it's only sent after the transaction commits.
Changes
provision()andprovisionDelete()functions now return the provisioner job instead of sending notifications internally.publishProvisionerJob()helper centralizes the notification logic and is called after each transaction completes.Closes:coder/internal#963