NotificationsYou must be signed in to change notification settings
Fork1.1k
Star11.8k

chore: abstract pg test logic and double runner sizes#21091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

dannykopping merged 11 commits intomainfromdk/fat-ci-bois

Dec 11, 2025

Merged

chore: abstract pg test logic and double runner sizes#21091

dannykopping merged 11 commits intomainfromdk/fat-ci-bois

Dec 11, 2025

Conversation

Copy link

Contributor

dannykopping commentedDec 4, 2025•
edited
Loading

This PR does two things, both in service of helping to (hopefully!) speed up CI:

abstracts the parallelism logic into a common action and has all PG-related jobs use it
doubles runner sizes from8->16 CPUs & 32->64GiB RAM* and concomitantly increases parallelism

I only focused on the PG-related jobs since they are generally slowest & most RAM-intensive.

*test-go-race-pg doubles from 16->32 CPUs & 64->128GiB RAM and likewise for the Windows runners; MacOS runners haveonly one size

NOTE: don't use the speed of the PG-related jobs in this PR's CI run as indicative. Tests run outsidemain may use cache, so the speed may seem artificially low.

github-actionsbot assigneddannykopping

Dec 4, 2025

dannykopping changed the title~~chore: abstract pg test logic, double runner sizes~~chore: abstract pg test logic and double runner sizes

Dec 4, 2025

dannykopping marked this pull request as ready for review

December 4, 2025 09:40

dannykopping requested a review fromjdomeracki-coder as acode owner

December 4, 2025 09:40

dannykopping requested review fromEmyrk andspikecurtis

December 4, 2025 09:40

Copy link

ContributorAuthor

dannykopping commentedDec 4, 2025

Hhmm I'm thinking the increased parallelism might be causing these two (newly-created) flakes:

coder/internal#1174
coder/internal#1173

There might be some contention / starvation occurring. Looking into it...

mafredri added a commit that referenced this pull request

Dec 5, 2025

feat(testutil): add lazy timeout context with location-based reset

3a067f5

It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().

mafredri mentioned this pull request

Dec 5, 2025

feat(testutil): add lazy timeout context with location-based reset#21120

Closed

mafredri added a commit that referenced this pull request

Dec 5, 2025

feat(testutil): add lazy timeout context with location-based reset

9896675

It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().

mafredri added a commit that referenced this pull request

Dec 5, 2025

feat(testutil): add lazy timeout context with location-based reset

e6e4582

It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().

Copy link

ContributorAuthor

dannykopping commentedDec 5, 2025

Haven't identified the source of the contention yet, but at least#21121 will prevent these tests from flaking.

dannykopping added3 commits

December 5, 2025 15:15

chore: abstract pg test logic, increase runner sizes

e60da78

Signed-off-by: Danny Kopping <danny@coder.com>

chore: make lint

86246ba

Signed-off-by: Danny Kopping <danny@coder.com>

chore: bump windows

211e727

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed thedk/fat-ci-bois branch from4fd7e1b to211e727Compare

December 5, 2025 13:16

mafredri added a commit that referenced this pull request

Dec 8, 2025

feat(testutil): add lazy timeout context with location-based reset

ee0be3f

It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().

mafredri added a commit that referenced this pull request

Dec 8, 2025

feat(testutil): add lazy timeout context with location-based reset

195362e

It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().

mafredri added a commit that referenced this pull request

Dec 8, 2025

feat(testutil): add lazy timeout context with location-based reset

c9238e2

It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().

mafredri added a commit that referenced this pull request

Dec 8, 2025

feat(testutil): add lazy timeout context with location-based reset

166e3b0

It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().

spikecurtis reviewed

Dec 10, 2025

View reviewed changes

.github/workflows/ci.yamlShow resolvedHide resolved

.github/workflows/ci.yaml

		postgres-version:"13"
		# Our macOS runners have 8 cores.
		test-parallelism-packages:"8"
		test-parallelism-tests:"16"

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Linux has 16 cores and 16x8 parallelism, but macOS has 8 cores and 8x16 parallelism --- seems wrong, since in both cases you can have 128 tests running concurrently.

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We can't scale MacOS any further, and for Linux I just naïvely doubled the package parallelism since we now have double the CPUs.

It was like this before:

          elif [ "${RUNNER_OS}" == "macOS" ]; then            # Our macOS runners have 8 cores. We set NUM_PARALLEL_TESTS to 16            # because the tests complete faster and Postgres doesn't choke. It seems            # that macOS's tmpfs is faster than the one on Windows.            export TEST_NUM_PARALLEL_PACKAGES=8            export TEST_NUM_PARALLEL_TESTS=16            # Only the CLI and Agent are officially supported on macOS and the rest are too flaky            export TEST_PACKAGES="./cli/... ./enterprise/cli/... ./agent/..."          elif [ "${RUNNER_OS}" == "Linux" ]; then            # Our Linux runners have 8 cores.            export TEST_NUM_PARALLEL_PACKAGES=8            export TEST_NUM_PARALLEL_TESTS=8          fi

Are you suggesting I bumptest-parallelism-tests to 16 for Linux as well? i.e. 256 parallelism.
That would be quadruple what we had before (8*8), where I was attempting to keep the resources to parallelism scaling linear.

Copy link

Contributor

spikecurtisDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If anything I'd cut the macOS parallelism. If you leave it as is, then maybe a comment explaining that the numbers were kinda determined empirically where things don't break horribly.

Copy link

ContributorAuthor

dannykoppingDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

https://github.com/coder/coder/blob/main/.github/workflows/ci.yaml#L467-L472

I haven't change the parallelism (sorry, it hard to track changes because of the reorganisation); if you're aware that it's the same, why do you want to cut parallelism?

According to this, it seems kinda OK?

Copy link

Contributor

spikecurtisDec 11, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.

Cutting macOS parallelism would align with that consistent model and be easier to reason about. I guess upping Linux parallelism would also be consistent, but I'm gun shy about increasing things and potentially causing more flakes.

In the absence of consistency, we can just document what we've observed and 🤷

Copy link

ContributorAuthor

dannykoppingDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Cool, documented in983515f 👍

Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.

Let's link up next week to try reason through this and develop a heuristic which will set the parallelism automatically and consistently across all these different platforms & jobs?

.github/workflows/ci.yaml OutdatedShow resolvedHide resolved

.github/workflows/ci.yamlShow resolvedHide resolved

.github/workflows/ci.yaml OutdatedShow resolvedHide resolved

Emyrk reviewed

Dec 10, 2025

View reviewed changes

.github/actions/test-go-pg/action.yamlShow resolvedHide resolved

mafredri reviewed

Dec 10, 2025

View reviewed changes

Copy link

Member

mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking forward to seeing speed improvements, nice work! Also interested in experimenting with PostgreSQL settings on Windows and Mac, perhaps we can eliminate ramdisk entirely as it should only be making things slower given a well configured pg.

.github/actions/test-go-pg/action.yamlShow resolvedHide resolved

.github/actions/test-go-pg/action.yaml OutdatedShow resolvedHide resolved

.github/workflows/ci.yaml

		if:runner.os == 'macOS'
		shell:bash
		run:\|
		# Postgres runs faster on a ramdisk on macOS.

Copy link

Member

mafredriDec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Have we verified this recently? I'd especially be interested in adjusting PostgreSQL settings to see if we can alleviate it rather than using ramdisk. We simply need to increase RAM retention for PG on macOS and it should be more efficient than placing both storage and cache in RAM.

Guessing this applies to Windows as well.

Copy link

ContributorAuthor

dannykoppingDec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not changing anything about this right now. We can follow this PR with some PG changes.

.github/workflows/ci.yaml OutdatedShow resolvedHide resolved

.github/workflows/ci.yamlShow resolvedHide resolved

dannykopping added6 commits

December 11, 2025 06:38

Merge branch 'main' of github.com:/coder/coder into dk/fat-ci-bois

7e3702f

chore: fix nonsensical Windows comment & add more detail to other com…

4af1430

…mentsSigned-off-by: Danny Kopping <danny@coder.com>

chore: max 1 test per core

7040c95

Signed-off-by: Danny Kopping <danny@coder.com>

chore: bash improvements

6cd78b7

Signed-off-by: Danny Kopping <danny@coder.com>

chore: align nightly-gauntlet.yaml with ci.yaml, only run mac/windows…

6c691c4

… tests on mainSigned-off-by: Danny Kopping <danny@coder.com>

chore: only run mac/windows jobs on main

f4d2a44

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed thedk/fat-ci-bois branch from5762ad6 tof4d2a44Compare

December 11, 2025 05:50

dannykopping requested review frommafredri andspikecurtis

December 11, 2025 06:07

chore: restore mac/windows steps on PRs

9f29ad0

Signed-off-by: Danny Kopping <danny@coder.com>

spikecurtis approved these changes

Dec 11, 2025

View reviewed changes

chore: document high macos parallelism

983515f

Signed-off-by: Danny Kopping <danny@coder.com>

dannykoppingenabled auto-merge (squash)

December 11, 2025 09:58

dannykopping merged commit84b7a03 intomain

Dec 11, 2025

32 checks passed

dannykopping deleted the dk/fat-ci-bois branch

December 11, 2025 10:12

github-actionsbot locked and limited conversation to collaborators

Dec 11, 2025

Labels

None yet

Movatterモバイル変換

chore: abstract pg test logic and double runner sizes#21091

chore: abstract pg test logic and double runner sizes#21091

Uh oh!

Conversation

dannykopping commentedDec 4, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

dannykopping commentedDec 4, 2025

Uh oh!

dannykopping commentedDec 5, 2025

Uh oh!

Uh oh!

spikecurtisDec 10, 2025

Choose a reason for hiding this comment

Uh oh!

dannykoppingDec 11, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisDec 11, 2025

Choose a reason for hiding this comment

Uh oh!

dannykoppingDec 11, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtisDec 11, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dannykoppingDec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredriDec 10, 2025

Choose a reason for hiding this comment

Uh oh!

dannykoppingDec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dannykopping commentedDec 4, 2025•
edited
Loading

spikecurtisDec 11, 2025•
edited
Loading