- Notifications
You must be signed in to change notification settings - Fork1k
chore: run macOS, windows, and race tests with Postgres in CI#15520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
0733229
tob1c3974
CompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM but It'd be good to hear from Dean who has a lot more experience working with our CI
.github/workflows/ci.yaml Outdated
api-key:${{ secrets.DATADOG_API_KEY }} | ||
# NOTE: this could instead be defined as a matrix strategy, but we want to | ||
# temporarily allow windows tests to fail. Using a matrix strategy here makes |
ethanndicksonNov 18, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I see the Windows tests are broken, I'm not sure if we want to merge a CI suite that often fails, even if it's not required (could slightly obscure real issues + not great optics). Perhaps it should just get skipped/commented out until we're able to improve the reliability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Since we do have the Windows tests covered by in memory for now, maybe these should be disabled for now to avoid ❌ appearing everywhere. Are there any issues tracking the windows tests breaking in postgres?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Are there any issues tracking the windows tests breaking in postgres?
I just added#15560.
for now, maybe these should be disabled for now to avoid ❌ appearing everywhere
Makes sense. I'll postpone merging this PR until Windows tests are fixed. I'll try to get it done this week.
.github/workflows/ci.yaml Outdated
runs-on:${{ github.repository_owner == 'coder' && 'depot-ubuntu-22.04-8' || 'ubuntu-latest' }} | ||
needs: | ||
-changes | ||
runs-on:${{ matrix.os == 'ubuntu-latest' && github.repository_owner == 'coder' && 'depot-ubuntu-22.04-4' || matrix.os == 'macos-latest' && github.repository_owner == 'coder' && 'macos-latest-xlarge' || matrix.os }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Holy mother of github actions interpolation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We also have this ongo-test
😭
// We execute these queries instead of using the embeddedpostgres | ||
// StartParams because it doesn't work on Windows. The library | ||
// seems to have a bug where it sends malformed parameters to | ||
// pg_ctl. It encloses each parameter in single quotes, which | ||
// Windows can't handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can we open a PR upstream to fix this bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sure. I openedfergusstrange/embedded-postgres#145 and I could submit a PR later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can you add the issue tracking link to this comment?
Uh oh!
There was an error while loading.Please reload this page.
We're going to postpone merging this PR until Windows tests are fixed.#15560 |
Patches tests that caused Windows Postgres CI in#15520 to consistently fail.I tested this by temporarily adding Postgres Windows CI to this PR.However, I reverted those changes to merge them with#15520. For reference, here's [apassing CIrun](https://github.com/coder/coder/actions/runs/11918816662/job/33219786238)from an earlier commit.**Note:** Although Windows tests now pass, they remain quite flaky. Irecommend running Postgres Windows CI to gather data on these flakes,but I don’t think it should be a required job just yet.
b3e3fac
to2739223
Comparef1394a8
to160e77c
Comparehugodutka commentedDec 2, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@deansheather@ethanndickson The Windows Postgres tests seem stable now. I made one major change in this PR to fix them: Windows now sets up a RAM disk for Postgres. I found that the flakiness was due to timing-sensitive tests failing because the suite ran slowly on Windows. The root cause wasIO performance on GitHub-hosted Windows runners. Using a RAM disk bypasses this issue, and test times are back to normal. I’ve also made the Windows PG tests required since they’re no longer that flaky - I see an odd flake every couple of runs. I’d appreciate a second look before merging this into main. Thanks! |
curl -L -o files.cab https://github.com/coder/imdisk-artifacts/raw/92a17839ebc0ee3e69be019f66b3e9b5d2de4482/files.cab | ||
curl -L -o install.bat https://github.com/coder/imdisk-artifacts/raw/92a17839ebc0ee3e69be019f66b3e9b5d2de4482/install.bat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sucks that people still use sourceforge lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That's Windows for you
// We execute these queries instead of using the embeddedpostgres | ||
// StartParams because it doesn't work on Windows. The library | ||
// seems to have a bug where it sends malformed parameters to | ||
// pg_ctl. It encloses each parameter in single quotes, which | ||
// Windows can't handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can you add the issue tracking link to this comment?
c7c35ef
intomainUh oh!
There was an error while loading.Please reload this page.
We have an effort underway to replace `dbmem` (#15109), and consequentlywe've begun running our full test-suite (with Postgres) on all supportedOSs - Windows, MacOS, and Linux, since#15520.Since this change, we've seen a marked decrease in the success rate ofour builds on `main` (note how the Windows/MacOS failures account forthe vast majority of failed builds):We're still investigating why these OSs are a lot less reliable. It'slikely that the VMs on which the builds are run have differentcharacteristics from our Ubuntu runners such as disk I/O, networklatency, or something else.**In the meantime, we need to start trusting CI failures in `main`again, as the current failures are too noisy / vague for us tocorrect.**We've also considered hosting our own runners where possible so we canget OS-level observability to rule out some possibilities.See the [meetingnotes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl)where we linked into this for more detail.This PR introduces several changes:1. Moves the full test-suite with Postgres on Windows/MacOS to the`nightly-gauntlet` workflowtradeoff: this means that any regressions may be more difficult todiscover since we merge to main several times a day2. Run only the CLI test-suite on each PR / merge to `main` onWindows/MacOS3. `test-go` is still running the full test-suite against all OSs(including the CLI ones), but will soon be removed once#15109 iscompleted since it uses `dbmem`4. Changes `nightly-gauntlet` to run at 4AM: we've seen severalinstances of the runner being stopped externally, and we're _guessing_this may have something to do with the midnight UTC execution time, whenother cron jobs may run5. Removes the existing `nightly-gauntlet` jobs since they haven'tpassed in a long time, indicating that nobody cares enough to fix themand they don't provide diagnostic value; we can restore them later ifnecessaryI've manually run both these new workflows successfully:- `ci`:https://github.com/coder/coder/actions/runs/12825874176/job/35764724907- `nightly-gauntlet`:https://github.com/coder/coder/actions/runs/12825539092---------Signed-off-by: Danny Kopping <danny@coder.com>Co-authored-by: Muhammad Atif Ali <atif@coder.com>
We have an effort underway to replace `dbmem` (#15109), and consequentlywe've begun running our full test-suite (with Postgres) on all supportedOSs - Windows, MacOS, and Linux, since#15520.Since this change, we've seen a marked decrease in the success rate ofour builds on `main` (note how the Windows/MacOS failures account forthe vast majority of failed builds):We're still investigating why these OSs are a lot less reliable. It'slikely that the VMs on which the builds are run have differentcharacteristics from our Ubuntu runners such as disk I/O, networklatency, or something else.**In the meantime, we need to start trusting CI failures in `main`again, as the current failures are too noisy / vague for us tocorrect.**We've also considered hosting our own runners where possible so we canget OS-level observability to rule out some possibilities.See the [meetingnotes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl)where we linked into this for more detail.This PR introduces several changes:1. Moves the full test-suite with Postgres on Windows/MacOS to the`nightly-gauntlet` workflowtradeoff: this means that any regressions may be more difficult todiscover since we merge to main several times a day2. Run only the CLI test-suite on each PR / merge to `main` onWindows/MacOS3. `test-go` is still running the full test-suite against all OSs(including the CLI ones), but will soon be removed once#15109 iscompleted since it uses `dbmem`4. Changes `nightly-gauntlet` to run at 4AM: we've seen severalinstances of the runner being stopped externally, and we're _guessing_this may have something to do with the midnight UTC execution time, whenother cron jobs may run5. Removes the existing `nightly-gauntlet` jobs since they haven'tpassed in a long time, indicating that nobody cares enough to fix themand they don't provide diagnostic value; we can restore them later ifnecessaryI've manually run both these new workflows successfully:- `ci`:https://github.com/coder/coder/actions/runs/12825874176/job/35764724907- `nightly-gauntlet`:https://github.com/coder/coder/actions/runs/12825539092---------Signed-off-by: Danny Kopping <danny@coder.com>Co-authored-by: Muhammad Atif Ali <atif@coder.com>
Uh oh!
There was an error while loading.Please reload this page.
This PR is the second in a series aimed at closing#15109.
Changes
scripts/embedded-pg/main.go
, which can start a native Postgres database. This is used to set up PG on Windows and macOS, as these platforms don't support Docker in Github Actions.test-go-pg
job on macOS and Windows tootest-go-race-go
job, which runs race tests with Postgres on Linux