- Notifications
You must be signed in to change notification settings - Fork1k
experiment: provisionerdaemon - investigate intermittent job wait failure#146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
experiment: provisionerdaemon - investigate intermittent job wait failure#146
Uh oh!
There was an error while loading.Please reload this page.
Conversation
codecovbot commentedFeb 2, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Codecov Report
@@ Coverage Diff @@## provisionerdaemon #146 +/- ##=====================================================- Coverage 67.35% 67.33% -0.03%===================================================== Files 101 101 Lines 5098 5100 +2 Branches 68 68 ===================================================== Hits 3434 3434+ Misses 1357 1354 -3- Partials 307 312 +5
Continue to review full report at Codecov.
|
Uh oh!
There was an error while loading.Please reload this page.
Investigating failures like this:https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32
where it looks like the job is completed, but the completion condition is never satisfied
Runs
Issue 1: Data race in
p.acquiredJobDone
context (provisioners.d
)Failure trace:https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:84
There is a race in the
p.acquiredJobDone
chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with<-p.acquiredJobDone
, but in parallel, anacquireJob
could've been started, which would create a new channel forp.acquiredJobDone
.The fix I tried was to also grab the
acquiredJobMutex
in theClose
function. This, at first, caused a deadlock - because there was another case where the mutexes could be grabbed in reverse order (acquiredJobMutex
-> thencloseMutex
). That other place, though, was storing a bool in an atomic, so actually didn't need the mutex guard.Attempted fix here:42ce721
Still hit a related race:https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:84
So tried a second fix here:84dd68a
The second fix didn't work, trying to switch from chan -> wait group:a8725cd