Commit5b72a43

dannykopping

and

matifali

authored

chore: improve CI reliability (#16169)

We have an effort underway to replace `dbmem` (#15109), and consequentlywe've begun running our full test-suite (with Postgres) on all supportedOSs - Windows, MacOS, and Linux, since#15520.Since this change, we've seen a marked decrease in the success rate ofour builds on `main` (note how the Windows/MacOS failures account forthe vast majority of failed builds):![image](https://github.com/user-attachments/assets/a02c15b7-037d-428a-a600-2aed60553ac0)We're still investigating why these OSs are a lot less reliable. It'slikely that the VMs on which the builds are run have differentcharacteristics from our Ubuntu runners such as disk I/O, networklatency, or something else.**In the meantime, we need to start trusting CI failures in `main`again, as the current failures are too noisy / vague for us tocorrect.**We've also considered hosting our own runners where possible so we canget OS-level observability to rule out some possibilities.See the [meetingnotes](https://www.notion.so/coderhq/CI-Investigation-Call-Notes-17dd579be59280d8897cc9fe4bb46695?pvs=6&utm_content=17dd579b-e592-80d8-897c-c9fe4bb46695&utm_campaign=T1ZPT2FL0&n=slack&n=slack_link_unfurl)where we linked into this for more detail.This PR introduces several changes:1. Moves the full test-suite with Postgres on Windows/MacOS to the`nightly-gauntlet` workflowtradeoff: this means that any regressions may be more difficult todiscover since we merge to main several times a day2. Run only the CLI test-suite on each PR / merge to `main` onWindows/MacOS3. `test-go` is still running the full test-suite against all OSs(including the CLI ones), but will soon be removed once#15109 iscompleted since it uses `dbmem`4. Changes `nightly-gauntlet` to run at 4AM: we've seen severalinstances of the runner being stopped externally, and we're _guessing_this may have something to do with the midnight UTC execution time, whenother cron jobs may run5. Removes the existing `nightly-gauntlet` jobs since they haven'tpassed in a long time, indicating that nobody cares enough to fix themand they don't provide diagnostic value; we can restore them later ifnecessaryI've manually run both these new workflows successfully:- `ci`:https://github.com/coder/coder/actions/runs/12825874176/job/35764724907- `nightly-gauntlet`:https://github.com/coder/coder/actions/runs/12825539092---------Signed-off-by: Danny Kopping <danny@coder.com>Co-authored-by: Muhammad Atif Ali <atif@coder.com>

1 parent738a7f6 commit5b72a43Copy full SHA for 5b72a43

File tree

3 files changed

+121

-74

lines changed

.github/workflows
- ci.yaml
- nightly-gauntlet.yaml
Makefile

3 files changed

+121

-74

lines changed

`‎.github/workflows/ci.yaml‎`

Lines changed: 56 additions & 32 deletions

Original file line number	Diff line number	Diff line change
`@@ -378,8 +378,62 @@ jobs:`
`378`	`378`	`with:`
`379`	`379`	`api-key:${{ secrets.DATADOG_API_KEY }}`
`380`	`380`
	`381`	`+# We don't run the full test-suite for Windows & MacOS, so we just run the CLI tests on every PR.`
	`382`	`+# We run the test suite in test-go-pg, including CLI.`
	`383`	`+test-cli:`
	`384`	`+runs-on:${{ matrix.os == 'macos-latest' && github.repository_owner == 'coder' && 'depot-macos-latest' \|\| matrix.os == 'windows-2022' && github.repository_owner == 'coder' && 'windows-latest-16-cores' \|\| matrix.os }}`
	`385`	`+needs:changes`
	`386`	`+if:needs.changes.outputs.go == 'true' \|\| needs.changes.outputs.ci == 'true' \|\| github.ref == 'refs/heads/main'`
	`387`	`+strategy:`
	`388`	`+matrix:`
	`389`	`+os:`
	`390`	`+ -macos-latest`
	`391`	`+ -windows-2022`
	`392`	`+steps:`
	`393`	`+ -name:Harden Runner`
	`394`	`+uses:step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f# v2.10.2`
	`395`	`+with:`
	`396`	`+egress-policy:audit`
	`397`	`+`
	`398`	`+ -name:Checkout`
	`399`	`+uses:actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871# v4.2.1`
	`400`	`+with:`
	`401`	`+fetch-depth:1`
	`402`	`+`
	`403`	`+ -name:Setup Go`
	`404`	`+uses:./.github/actions/setup-go`
	`405`	`+`
	`406`	`+ -name:Setup Terraform`
	`407`	`+uses:./.github/actions/setup-tf`
	`408`	`+`
	`409`	`+# Sets up the ImDisk toolkit for Windows and creates a RAM disk on drive R:.`
	`410`	`+ -name:Setup ImDisk`
	`411`	`+if:runner.os == 'Windows'`
	`412`	`+uses:./.github/actions/setup-imdisk`
	`413`	`+`
	`414`	`+ -name:Test CLI`
	`415`	`+env:`
	`416`	`+TS_DEBUG_DISCO:"true"`
	`417`	`+LC_CTYPE:"en_US.UTF-8"`
	`418`	`+LC_ALL:"en_US.UTF-8"`
	`419`	`+shell:bash`
	`420`	`+run:\|`
	`421`	`+ # By default Go will use the number of logical CPUs, which`
	`422`	`+ # is a fine default.`
	`423`	`+ PARALLEL_FLAG=""`
	`424`	`+`
	`425`	`+ make test-cli`
	`426`	`+`
	`427`	`+ -name:Upload test stats to Datadog`
	`428`	`+timeout-minutes:1`
	`429`	`+continue-on-error:true`
	`430`	`+uses:./.github/actions/upload-datadog`
	`431`	`+if:success() \|\| failure()`
	`432`	`+with:`
	`433`	`+api-key:${{ secrets.DATADOG_API_KEY }}`
	`434`	`+`
`381`	`435`	`test-go-pg:`
`382`		`-runs-on:${{ matrix.os == 'ubuntu-latest' && github.repository_owner == 'coder' && 'depot-ubuntu-22.04-4' \|\| matrix.os== 'macos-latest' && github.repository_owner == 'coder' && 'depot-macos-latest' \|\| matrix.os == 'windows-2022' && github.repository_owner == 'coder' && 'windows-latest-16-cores' \|\| matrix.os}}`
	`436`	`+runs-on:${{ matrix.os == 'ubuntu-latest' && github.repository_owner == 'coder' && 'depot-ubuntu-22.04-4' \|\| matrix.os }}`
`383`	`437`	`needs:changes`
`384`	`438`	`if:needs.changes.outputs.go == 'true' \|\| needs.changes.outputs.ci == 'true' \|\| github.ref == 'refs/heads/main'`
`385`	`439`	# This timeout must be greater than the timeout set by `go test` in
`@@ -391,8 +445,6 @@ jobs:`
`391`	`445`	`matrix:`
`392`	`446`	`os:`
`393`	`447`	`-ubuntu-latest`
`394`		`- -macos-latest`
`395`		`- -windows-2022`
`396`	`448`	`steps:`
`397`	`449`	`-name:Harden Runner`
`398`	`450`	`uses:step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f# v2.10.2`
`@@ -423,39 +475,11 @@ jobs:`
`423`	`475`	`LC_ALL:"en_US.UTF-8"`
`424`	`476`	`shell:bash`
`425`	`477`	`run:\|`
`426`		`- # if macOS, install google-chrome for scaletests`
`427`		`- # As another concern, should we really have this kind of external dependency`
`428`		`- # requirement on standard CI?`
`429`		`- if [ "${{ matrix.os }}" == "macos-latest" ]; then`
`430`		`- brew install google-chrome`
`431`		`- fi`
`432`		`-`
`433`	`478`	`# By default Go will use the number of logical CPUs, which`
`434`	`479`	`# is a fine default.`
`435`	`480`	`PARALLEL_FLAG=""`
`436`	`481`
`437`		`- # macOS will output "The default interactive shell is now zsh"`
`438`		`- # intermittently in CI...`
`439`		`- if [ "${{ matrix.os }}" == "macos-latest" ]; then`
`440`		`- touch ~/.bash_profile && echo "export BASH_SILENCE_DEPRECATION_WARNING=1" >> ~/.bash_profile`
`441`		`- fi`
`442`		`-`
`443`		`- if [ "${{ runner.os }}" == "Linux" ]; then`
`444`		`- make test-postgres`
`445`		`- elif [ "${{ runner.os }}" == "Windows" ]; then`
`446`		`- # Create a temp dir on the R: ramdisk drive for Windows. The default`
`447`		`- # C: drive is extremely slow: https://github.com/actions/runner-images/issues/8755`
`448`		`- mkdir -p "R:/temp/embedded-pg"`
`449`		`- go run scripts/embedded-pg/main.go -path "R:/temp/embedded-pg"`
`450`		`- # Reduce test parallelism, mirroring what we do for race tests.`
`451`		`- # We'd been encountering issues with timing related flakes, and`
`452`		`- # this seems to help.`
`453`		`- DB=ci gotestsum --format standard-quiet -- -v -short -count=1 -parallel 4 -p 4 ./...`
`454`		`- else`
`455`		`- go run scripts/embedded-pg/main.go`
`456`		`- # Reduce test parallelism, like for Windows above.`
`457`		`- DB=ci gotestsum --format standard-quiet -- -v -short -count=1 -parallel 4 -p 4 ./...`
`458`		`- fi`
	`482`	`+ make test-postgres`
`459`	`483`
`460`	`484`	`-name:Upload test stats to Datadog`
`461`	`485`	`timeout-minutes:1`

`‎.github/workflows/nightly-gauntlet.yaml‎`

Lines changed: 61 additions & 42 deletions

Original file line number	Diff line number	Diff line change
`@@ -3,22 +3,27 @@`
`3`	`3`	`name:nightly-gauntlet`
`4`	`4`	`on:`
`5`	`5`	`schedule:`
`6`		`-# Every day atmidnight`
`7`		`- -cron:"00 * **"`
	`6`	`+# Every day at4AM`
	`7`	`+ -cron:"04 * *1-5"`
`8`	`8`	`workflow_dispatch:`
`9`	`9`
`10`	`10`	`permissions:`
`11`	`11`	`contents:read`
`12`	`12`
`13`	`13`	`jobs:`
`14`		`-go-race:`
`15`		`-# While GitHub's toaster runners are likelier to flake, we want consistency`
`16`		`-# between this environment and the regular test environment for DataDog`
`17`		`-# statistics and to only show real workflow threats.`
`18`		`-runs-on:${{ github.repository_owner == 'coder' && 'depot-ubuntu-22.04-8' \|\| 'ubuntu-latest' }}`
`19`		`-# This runner costs 0.016 USD per minute,`
`20`		`-# so 0.016 * 240 = 3.84 USD per run.`
`21`		`-timeout-minutes:240`
	`14`	`+test-go-pg:`
	`15`	`+runs-on:${{ matrix.os == 'macos-latest' && github.repository_owner == 'coder' && 'depot-macos-latest' \|\| matrix.os == 'windows-2022' && github.repository_owner == 'coder' && 'windows-latest-16-cores' \|\| matrix.os }}`
	`16`	`+if:github.ref == 'refs/heads/main'`
	`17`	+# This timeout must be greater than the timeout set by `go test` in
	`18`	+# `make test-postgres` to ensure we receive a trace of running
	`19`	`+# goroutines. Setting this to the timeout +5m should work quite well`
	`20`	`+# even if some of the preceding steps are slow.`
	`21`	`+timeout-minutes:25`
	`22`	`+strategy:`
	`23`	`+matrix:`
	`24`	`+os:`
	`25`	`+ -macos-latest`
	`26`	`+ -windows-2022`
`22`	`27`	`steps:`
`23`	`28`	`-name:Harden Runner`
`24`	`29`	`uses:step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f# v2.10.2`
`@@ -27,58 +32,72 @@ jobs:`
`27`	`32`
`28`	`33`	`-name:Checkout`
`29`	`34`	`uses:actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871# v4.2.1`
	`35`	`+with:`
	`36`	`+fetch-depth:1`
`30`	`37`
`31`	`38`	`-name:Setup Go`
`32`	`39`	`uses:./.github/actions/setup-go`
`33`	`40`
`34`	`41`	`-name:Setup Terraform`
`35`	`42`	`uses:./.github/actions/setup-tf`
`36`	`43`
`37`		`- -name:Run Tests`
`38`		`-run:\|`
`39`		`- # -race is likeliest to catch flaky tests`
`40`		`- # due to correctness detection and its performance`
`41`		`- # impact.`
`42`		`- gotestsum --junitfile="gotests.xml" -- -timeout=240m -count=10 -race ./...`
	`44`	`+# Sets up the ImDisk toolkit for Windows and creates a RAM disk on drive R:.`
	`45`	`+ -name:Setup ImDisk`
	`46`	`+if:runner.os == 'Windows'`
	`47`	`+uses:./.github/actions/setup-imdisk`
`43`	`48`
`44`		`- -name:Upload test results to DataDog`
`45`		`-uses:./.github/actions/upload-datadog`
`46`		`-if:always()`
`47`		`-with:`
`48`		`-api-key:${{ secrets.DATADOG_API_KEY }}`
	`49`	`+ -name:Test with PostgreSQL Database`
	`50`	`+env:`
	`51`	`+POSTGRES_VERSION:"13"`
	`52`	`+TS_DEBUG_DISCO:"true"`
	`53`	`+LC_CTYPE:"en_US.UTF-8"`
	`54`	`+LC_ALL:"en_US.UTF-8"`
	`55`	`+shell:bash`
	`56`	`+run:\|`
	`57`	`+ # if macOS, install google-chrome for scaletests`
	`58`	`+ # As another concern, should we really have this kind of external dependency`
	`59`	`+ # requirement on standard CI?`
	`60`	`+ if [ "${{ matrix.os }}" == "macos-latest" ]; then`
	`61`	`+ brew install google-chrome`
	`62`	`+ fi`
`49`	`63`
`50`		`-go-timing:`
`51`		`-# We run these tests with p=1 so we don't need a lot of compute.`
`52`		`-runs-on:${{ github.repository_owner == 'coder' && 'depot-ubuntu-22.04' \|\| 'ubuntu-latest' }}`
`53`		`-timeout-minutes:10`
`54`		`-steps:`
`55`		`- -name:Harden Runner`
`56`		`-uses:step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f# v2.10.2`
`57`		`-with:`
`58`		`-egress-policy:audit`
	`64`	`+ # By default Go will use the number of logical CPUs, which`
	`65`	`+ # is a fine default.`
	`66`	`+ PARALLEL_FLAG=""`
`59`	`67`
`60`		`- -name:Checkout`
`61`		`-uses:actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871# v4.2.1`
	`68`	`+ # macOS will output "The default interactive shell is now zsh"`
	`69`	`+ # intermittently in CI...`
	`70`	`+ if [ "${{ matrix.os }}" == "macos-latest" ]; then`
	`71`	`+ touch ~/.bash_profile && echo "export BASH_SILENCE_DEPRECATION_WARNING=1" >> ~/.bash_profile`
	`72`	`+ fi`
`62`	`73`
`63`		`- -name:Setup Go`
`64`		`-uses:./.github/actions/setup-go`
	`74`	`+ if [ "${{ runner.os }}" == "Windows" ]; then`
	`75`	`+ # Create a temp dir on the R: ramdisk drive for Windows. The default`
	`76`	`+ # C: drive is extremely slow: https://github.com/actions/runner-images/issues/8755`
	`77`	`+ mkdir -p "R:/temp/embedded-pg"`
	`78`	`+ go run scripts/embedded-pg/main.go -path "R:/temp/embedded-pg"`
	`79`	`+ else`
	`80`	`+ go run scripts/embedded-pg/main.go`
	`81`	`+ fi`
`65`	`82`
`66`		`- -name:Run Tests`
`67`		`-run:\|`
`68`		`- gotestsum --junitfile="gotests.xml" -- --tags="timing" -p=1 -run='_Timing/' ./...`
	`83`	`+ # Reduce test parallelism, mirroring what we do for race tests.`
	`84`	`+ # We'd been encountering issues with timing related flakes, and`
	`85`	`+ # this seems to help.`
	`86`	`+ DB=ci gotestsum --format standard-quiet -- -v -short -count=1 -parallel 4 -p 4 ./...`
`69`	`87`
`70`		`- -name:Upload test results to DataDog`
	`88`	`+ -name:Upload test stats to Datadog`
	`89`	`+timeout-minutes:1`
	`90`	`+continue-on-error:true`
`71`	`91`	`uses:./.github/actions/upload-datadog`
`72`		`-if:always()`
	`92`	`+if:success() \|\| failure()`
`73`	`93`	`with:`
`74`	`94`	`api-key:${{ secrets.DATADOG_API_KEY }}`
`75`	`95`
`76`	`96`	`notify-slack-on-failure:`
`77`	`97`	`needs:`
`78`		`- -go-race`
`79`		`- -go-timing`
	`98`	`+ -test-go-pg`
`80`	`99`	`runs-on:ubuntu-latest`
`81`		`-if:failure()`
	`100`	`+if:failure() && github.ref == 'refs/heads/main'`
`82`	`101`
`83`	`102`	`steps:`
`84`	`103`	`-name:Send Slack notification`

`‎Makefile‎`

Lines changed: 4 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -807,6 +807,10 @@ test:`
`807`	`807`	`$(GIT_FLAGS) gotestsum --format standard-quiet -- -v -short -count=1 ./...`
`808`	`808`	`.PHONY: test`
`809`	`809`
	`810`	`+test-cli:`
	`811`	`+$(GIT_FLAGS) gotestsum --format standard-quiet -- -v -short -count=1 ./cli/...`
	`812`	`+.PHONY: test-cli`
	`813`	`+`
`810`	`814`	`# sqlc-cloud-is-setup will fail if no SQLc auth token is set. Use this as a`
`811`	`815`	`# dependency for any sqlc-cloud related targets.`
`812`	`816`	`sqlc-cloud-is-setup:`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit5b72a43

File tree

3 files changed

3 files changed

`‎.github/workflows/ci.yaml‎`

`‎.github/workflows/nightly-gauntlet.yaml‎`

`‎Makefile‎`

0 commit comments