This PR starts caching Terraform providers used byTestProvision inprovisioner/terraform/provision_test.go. The goal is to improve the reliability of this test by cutting down on the number of network calls to external services. It leverages GitHub Actions cache, whichon depot runners is persisted for 14 days by default.

Other than the aforementionedTestProvision, I couldn't find any other tests which depend on external terraform providers.

github-actionsbot assignedhugodutka

Apr 12, 2025

hugodutka force-pushed thehugodutka/terraform-providers-ci-cache branch 12 times, most recently from9a5b5a1 to8885000Compare

April 13, 2025 19:47

hugodutka mentioned this pull request

Apr 14, 2025

flake: TestProvision/returns-modulescoder/internal#322

Closed

set up a terraform providers mirror for tests

f11fea6

github-actionsbot added the staleThis issue is like stale bread. label

Apr 21, 2025

johnstcn removed the staleThis issue is like stale bread. label

Apr 22, 2025

hugodutka force-pushed thehugodutka/terraform-providers-ci-cache branch 11 times, most recently from9d41243 to2ab657cCompare

April 22, 2025 12:05

add cache to CI

3a9e0b3

hugodutka force-pushed thehugodutka/terraform-providers-ci-cache branch from2ab657c to3a9e0b3Compare

April 22, 2025 12:17

hugodutka marked this pull request as ready for review

April 22, 2025 13:13

hugodutka requested a review fromspikecurtis

April 22, 2025 13:13

spikecurtis reviewed

Apr 23, 2025

View reviewed changes

provisioner/terraform/provision_test.go

		file:=templateFiles[fileName]
		_,err:=hasher.Write([]byte(fileName))
		require.NoError(t,err)
		_,err=hasher.Write([]byte(file))

Copy link

Contributor

spikecurtisApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You need to include some sort of delimiter ordogfood: stuff will hash the same asdog: foodstuff

provisioner/terraform/provision_test.goShow resolvedHide resolved

provisioner/terraform/provision_test.go Outdated


		forfileName,file:=rangetemplateFiles {
		filePath:=filepath.Join(filesDir,fileName)
		if_,err:=os.Stat(filePath);os.IsNotExist(err) {

Copy link

Contributor

spikecurtisApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

didn't we just check that the directory doesn't already exist? How could the file already exist if the directory doesn't?

Copy link

ContributorAuthor

hugodutkaApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You're right, this check is redundant. I'll remove it.

provisioner/terraform/provision_test.go Outdated

		// Each test gets a unique cache dir based on its name and template files.
		// This ensures that tests can download providers in parallel and that they
		// will redownload providers if the template files change.
		hash:=hashTemplateFilesAndTestName(t,t.Name(),templateFiles)

Copy link

Contributor

spikecurtisApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Providers are versioned, so I don't think each test or each version of each test needs its own unique mirror.

It results in multiple copies of the providers being downloaded on each system, and there doesn't seem to be any mechanism to clean up old versions of tests. The CI caches expire after 14 days, but people's personal development environments won't get cleaned up.

terraform providers mirror can be run multiple times on the same target directory and it will merge the results:

You can run terraform providers mirror again on an existing mirror directory to update it with new packages.

https://developer.hashicorp.com/terraform/cli/commands/providers/mirror

So, what do you think of doing two passes over the test cases: one serially that sets of the mirror with the totality of the modules we need, then a second, parallel pass that runs the tests?

Copy link

ContributorAuthor

hugodutkaApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Doing a serial pass is pretty slow if you haven't cached the providers beforehand. I just tried it, and it takes 27 seconds to cache the providers serially in my Coder workspace. In contrast, doing it in parallel only takes 5 seconds.

The cache size across all tests is currently around 70 MB - probably not a big deal. That said, it could balloon if someone starts iterating on a test locally. To avoid that, maybe we should add cleanup logic instead of doing a serial pass?

We already know the paths to provider directories based on test hashes, so at the start ofTestProvision, we could check which ones are needed and remove the rest.

Copy link

Contributor

spikecurtisApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Cleanup logic would also be fine here, so we don't blow up people's cache.

.github/actions/setup-test-cache/action.yaml Outdated

		with:
		path:${{ inputs.cache-path }}
		# The key doesn't need to include an OS name. The action already takes
		# that into account: https://github.com/actions/cache/tree/5a3ec84eff668545956fd18022155c47e93e2684?tab=readme-ov-file#cache-version

Copy link

Contributor

spikecurtisApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think we're entirely out of the woods here. While the different paths on Windows vs Linux might mean that those caches are separate, different architectures on the same OS don't seem like they'd have a different cache version.

Also, the paths might not be so different on macOS vs Linux, for example.

Copy link

ContributorAuthor

hugodutkaApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmm, you're right about different architectures on the same OS, I haven't thought about that. I'll generate unique cache key prefixes based on os and architecture.

.github/actions/setup-test-cache/action.yaml Outdated

		# job is running on the main branch.
		# Without it, PRs would create one-use caches that would linger until
		# expiration and we'd be charged for them. I evaluated a couple of options
		# for limiting the cache to the main branch, and forking was the simplest.

Copy link

Contributor

spikecurtisApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can't we handle this use case by using explicitrestore andsave steps? I.e. we restore in all cases, but only save onmain?

https://github.com/actions/cache/blob/5a3ec84eff668545956fd18022155c47e93e2684/caching-strategies.md#saving-cache

Copy link

ContributorAuthor

hugodutkaApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I considered it but the GitHub Actions syntax makes it unwieldy. I wanted to package all the cache steps into a single action to avoid code duplication across the 5 workflows that need the cache. I tried doing it withrestore andsave, but found thatI cannot add adefer-likesave statement in a composite action. So I'd need to add 3 additional steps to each of the workflows: one to generate the cache keys, onerestore, onesave. That's a lot of code duplication - I believe the current approach is cleaner.

Copy link

Contributor

spikecurtisApr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Forking the entire repo seems like a lot of code duplication! It's just better hidden, and means we have yet another fork to manage.

Copy link

ContributorAuthor

hugodutkaApr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That's a good point. I'll refactor it to use the more verbose restore and save approach.

.github/actions/setup-test-cache/action.yaml Outdated

		# This prevents PRs from rebuilding the cache on the first day of the month.
		restore-keys:\|
		${{ inputs.key-prefix }}-${{ steps.dates.outputs.year-month }}-
		${{ github.ref != 'refs/heads/main' && format('{0}-{1}-', inputs.key-prefix, steps.dates.outputs.prev-year-month) \|\| '' }}

Copy link

Contributor

spikecurtisApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't get why we need to have this special case for PRs. If we are preventing PRs from saving to the cache anyway, why do we need to restrict only PRs from restoring last month's cache if this month is not available?

Also, why are we even bothering to match on month? If today's cache isn't available, doesn't the cache action automatically match the most recent available cache based on thekey-prefix alone?

Copy link

ContributorAuthor

hugodutkaApr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This approach lets us reset the cache completely at the start of each month. We upload an updated cache daily, and each update is based on the previous day's cache. Without the monthly cut-off onmain, any file added to the cache would persist indefinitely. The monthly reset ensures that old or obsolete files do not linger in the cache forever.

Copy link

Contributor

spikecurtisApr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If you implement cleanup in the test code, which I think is a good idea, then we'll only cache what we need daily, and a monthly reset won't be required.

Copy link

ContributorAuthor

hugodutkaApr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm hesitant to get rid of the monthly clean-up for two reasons:

If other tests start using the cache and don't clean up properly after themselves, the cache will grow.
If a Terraform provider that one of the tests depends on becomes permanently unavailable, we won't be able to detect it without a cache reset.

I don't think matching on the month complicates the code too much, so I'd be for keeping it.