Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

chore: abstract pg test logic and double runner sizes#21091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
dannykopping merged 11 commits intomainfromdk/fat-ci-bois
Dec 11, 2025

Conversation

@dannykopping
Copy link
Contributor

@dannykoppingdannykopping commentedDec 4, 2025
edited
Loading

This PR does two things, both in service of helping to (hopefully!) speed up CI:

  1. abstracts the parallelism logic into a common action and has all PG-related jobs use it
  2. doubles runner sizes from8->16 CPUs & 32->64GiB RAM* and concomitantly increases parallelism

I only focused on the PG-related jobs since they are generally slowest & most RAM-intensive.

image

*test-go-race-pg doubles from 16->32 CPUs & 64->128GiB RAM and likewise for the Windows runners; MacOS runners haveonly one size

NOTE: don't use the speed of the PG-related jobs in this PR's CI run as indicative. Tests run outsidemain may use cache, so the speed may seem artificially low.

@dannykoppingdannykopping changed the titlechore: abstract pg test logic, double runner sizeschore: abstract pg test logic and double runner sizesDec 4, 2025
@dannykoppingdannykopping marked this pull request as ready for reviewDecember 4, 2025 09:40
@dannykopping
Copy link
ContributorAuthor

Hhmm I'm thinking the increased parallelism might be causing these two (newly-created) flakes:

coder/internal#1174
coder/internal#1173

There might be some contention / starvation occurring. Looking into it...

mafredri added a commit that referenced this pull requestDec 5, 2025
It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull requestDec 5, 2025
It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull requestDec 5, 2025
It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().
@dannykopping
Copy link
ContributorAuthor

Haven't identified the source of the contention yet, but at least#21121 will prevent these tests from flaking.

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
mafredri added a commit that referenced this pull requestDec 8, 2025
It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull requestDec 8, 2025
It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull requestDec 8, 2025
It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull requestDec 8, 2025
It's common to create a context early in a test body, then do setup workunrelated to that context. By the time the context is actually used, itmay have already timed out. This was detected as test failures in#21091.The new Context() function returns a context that resets its timeout whenaccessed from new lines in the test file. The timeout does not begin untilthe context is first used (lazy initialization).This is useful for integration tests that pass contexts through manysubsystems, where each subsystem should get a fresh timeout window.Key behaviors:- Timer starts on first Done(), Deadline(), or Err() call- Value() does not trigger initialization (used for tracing/logging)- Each unique line in a _test.go file gets a fresh timeout window- Same-line access (e.g., in loops) does not reset- Expired contexts cannot be resurrectedLimitations:- Wrapping with a child context (e.g., context.WithCancel) prevents resets  since the child's methods don't call through to the parent- Storing the Done() channel prevents resets on subsequent accessesThe original fixed-timeout behavior is available via ContextFixed().
postgres-version:"13"
# Our macOS runners have 8 cores.
test-parallelism-packages:"8"
test-parallelism-tests:"16"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Linux has 16 cores and 16x8 parallelism, but macOS has 8 cores and 8x16 parallelism --- seems wrong, since in both cases you can have 128 tests running concurrently.

mafredri reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We can't scale MacOS any further, and for Linux I just naïvely doubled the package parallelism since we now have double the CPUs.

It was like this before:

          elif [ "${RUNNER_OS}" == "macOS" ]; then            # Our macOS runners have 8 cores. We set NUM_PARALLEL_TESTS to 16            # because the tests complete faster and Postgres doesn't choke. It seems            # that macOS's tmpfs is faster than the one on Windows.            export TEST_NUM_PARALLEL_PACKAGES=8            export TEST_NUM_PARALLEL_TESTS=16            # Only the CLI and Agent are officially supported on macOS and the rest are too flaky            export TEST_PACKAGES="./cli/... ./enterprise/cli/... ./agent/..."          elif [ "${RUNNER_OS}" == "Linux" ]; then            # Our Linux runners have 8 cores.            export TEST_NUM_PARALLEL_PACKAGES=8            export TEST_NUM_PARALLEL_TESTS=8          fi

Are you suggesting I bumptest-parallelism-tests to 16 for Linux as well? i.e. 256 parallelism.
That would be quadruple what we had before (8*8), where I was attempting to keep the resources to parallelism scaling linear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If anything I'd cut the macOS parallelism. If you leave it as is, then maybe a comment explaining that the numbers were kinda determined empirically where things don't break horribly.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

https://github.com/coder/coder/blob/main/.github/workflows/ci.yaml#L467-L472

I haven't change the parallelism (sorry, it hard to track changes because of the reorganisation); if you're aware that it's the same, why do you want to cut parallelism?

Image

According to this, it seems kinda OK?

Copy link
Contributor

@spikecurtisspikecurtisDec 11, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.

Cutting macOS parallelism would align with that consistent model and be easier to reason about. I guess upping Linux parallelism would also be consistent, but I'm gun shy about increasing things and potentially causing more flakes.

In the absence of consistency, we can just document what we've observed and 🤷

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Cool, documented in983515f 👍

Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.

Let's link up next week to try reason through this and develop a heuristic which will set the parallelism automatically and consistently across all these different platforms & jobs?

Copy link
Member

@mafredrimafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking forward to seeing speed improvements, nice work! Also interested in experimenting with PostgreSQL settings on Windows and Mac, perhaps we can eliminate ramdisk entirely as it should only be making things slower given a well configured pg.

if:runner.os == 'macOS'
shell:bash
run:|
# Postgres runs faster on a ramdisk on macOS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Have we verified this recently? I'd especially be interested in adjusting PostgreSQL settings to see if we can alleviate it rather than using ramdisk. We simply need to increase RAM retention for PG on macOS and it should be more efficient than placing both storage and cache in RAM.

Guessing this applies to Windows as well.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not changing anything about this right now. We can follow this PR with some PG changes.

…mentsSigned-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
… tests on mainSigned-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@dannykoppingdannykoppingenabled auto-merge (squash)December 11, 2025 09:58
@dannykoppingdannykopping merged commit84b7a03 intomainDec 11, 2025
32 checks passed
@dannykoppingdannykopping deleted the dk/fat-ci-bois branchDecember 11, 2025 10:12
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsDec 11, 2025
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@spikecurtisspikecurtisspikecurtis approved these changes

@EmyrkEmyrkEmyrk left review comments

@jdomeracki-coderjdomeracki-coderAwaiting requested review from jdomeracki-coderjdomeracki-coder is a code owner

@mafredrimafredriAwaiting requested review from mafredri

Assignees

@dannykoppingdannykopping

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

5 participants

@dannykopping@mafredri@spikecurtis@Emyrk

[8]ページ先頭

©2009-2025 Movatter.jp