By design, concurrent calls toAcquire in the file cache all share a single database fetch. This is by design, so that everyone can share in the success of whoever asked for the file first. That's kind of what caches do!

but one problem with the current implementation is that errors are also shared. This is mostly fine, because once all of the references are dropped, the cache entry will be freed, and the nextAcquire will trigger a new fetch. However, if enough people are trying to load the same file at once, you could imagine how they might keep retrying and the reference count neverquite hits zero.

To combat this, just immediately and forcibly remove errors from the cache, even if they still have references. Whoever is the first to retry afterwards will trigger a new fetch (like we want), which can then again be shared by others who retry.

Related, one opportunity to reduce the potential for errors we have is to usecontext.Background() for the database fetch so that a canceled request context cannot disrupt others who may be waiting for the file. We can then manually check the context outside of theLoad, just like we already do with authorization.

Emyrkand others added2 commits

June 22, 2025 22:56

test: unit test to excercise polluted file cache with error

780233b

chore: purge file cache entries on error

8a6deb1

github-actionsbot assignedaslilac

Jun 24, 2025

aslilac added2 commits

June 24, 2025 22:45

Merge branch 'stevenmasley/file_cache_error' into lilac/dont-cache-er…

bf6562a

…rors

proper release gating

610740a

aslilac requested a review fromEmyrk

June 24, 2025 22:50

aslilac marked this pull request as ready for review

June 24, 2025 22:50

aslilac added2 commits

June 24, 2025 23:11

lint

ec53459

I win at debugging deadlocks today

df7acff

Emyrk reviewed

Jun 25, 2025

View reviewed changes

coderd/files/cache.goShow resolvedHide resolved

coderd/files/cache.go Outdated

Comment on lines 161 to 163

		// Check if the caller's context was canceled
		if err := ctx.Err(); err != nil {
		return nil, err

Copy link

Member

EmyrkJun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why do we need to check this? It will fail theAuthorize if it is cancelled

And if we do this check, do we need to close?

Suggested change

	// Check if the caller's context was canceled
	iferr:=ctx.Err();err!=nil {
	returnnil,err
	// Check if the caller's context was canceled
	iferr:=ctx.Err();err!=nil {
	e.close()
	returnnil,err

Copy link

MemberAuthor

aslilacJun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I just thought it felt nice to check explicitly 🤷‍♀️

Copy link

Member

EmyrkJun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we defer the context handling toAuthorize?

It could be cancelled right after this check. It does not protect us from anything down the callstack.

Copy link

MemberAuthor

aslilacJun 27, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

it depends on the implementation ofAuthorize. for example, this test fails against dbmock if you remove the check, but is fine when it's in place:

dbM.EXPECT().GetFileByID(gomock.Any(),gomock.Any()).DoAndReturn(func(mTx context.Context,fileID uuid.UUID) (database.File,error) {return database.File{ID:fileID,Data:make([]byte,100),},nil})//nolint:gocritic // Unit testingcache:=files.New(prometheus.NewRegistry(),&coderdtest.FakeAuthorizer{})// Cancel the context for the first call; should fail.ctx,cancel:=context.WithCancel(dbauthz.AsFileReader(testutil.Context(t,testutil.WaitShort)))cancel()_,err:=cache.Acquire(ctx,dbM,fileID)assert.ErrorIs(t,err,context.Canceled)

nothing aboutAuthorize explicitly means that ithas to check context cancellation. none of the implementations ever check explicitly, so if the specific implementation doesn't happen to call any blocking/cancelable code it will continue anyway. I really feel like there is value in keeping this check separate and explicit.

coderd/files/cache.go OutdatedShow resolvedHide resolved

aslilac added3 commits

June 25, 2025 17:01

chore: purge file cache entries on error

2ecb1d7

Merge branch 'main' into lilac/dont-cache-errors

0fa1f6b

hmm...

db89836

Copy link

MemberAuthor

aslilac commentedJun 25, 2025•
edited
Loading

btw@Emyrk, the test as you originally wrote it assumed thatany second caller would refetch, regardless of timing. but we discussed loosening it a bit so that any callerafter the actual errored load would refetch, which is much more timing dependent. I can't really think of a good way to definitively test this behavior, because waiting until after the first fetch errors to run the second fetch means we're also waiting until the refcount would hit zero, which would clear it regardless of error state anyway. but if we call any earlier, most of the time the second caller gets the error, rarely taking long enough to trigger a refetch.

maybe we could add some method to "leak" a reference for testing purposes to ensure that the file is refetched anyway, but I'm never a fan of adding extra complexity just to make something testable.

at this point the calls are just serialized anyway

aab0335

Copy link

Member

Emyrk commentedJun 26, 2025

btw@Emyrk, the test as you originally wrote it assumed that any second caller would refetch, regardless of timing. but we discussed loosening it a bit so that any caller after the actual errored load would refetch, which is much more timing dependent. I can't really think of a good way to definitively test this behavior, because waiting until after the first fetch errors to run the second fetch means we're also waiting until the refcount would hit zero, which would clear it regardless of error state anyway. but if we call any earlier, most of the time the second caller gets the error, rarely taking long enough to trigger a refetch.

Yes, 100% the original test is not really relevant anymore.

maybe we could add some method to "leak" a reference for testing purposes to ensure that the file is refetched anyway, but I'm never a fan of adding extra complexity just to make something testable.

I wonder if we can make something work with aninternal test and manually calling the lock 🤔. I don't have any fancy ideas off the top of my head 😢

Emyrk reviewed

Jun 26, 2025

View reviewed changes

coderd/files/cache.go Outdated

Comment on lines 216 to 227

		close: func() {
		entry.lock.Lock()
		defer entry.lock.Unlock()

		entry.refCount--
		c.currentOpenFileReferences.Dec()
		if entry.refCount > 0 {
		return
		}

		entry.purge()
		},

Copy link

Member

EmyrkJun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Locking behavior

Acquire locks Cache, then Entry
Close locks Entry, then purge locks the cache

I think there is a deadlock.

If you callAcquire the cache gets locked, then the entry.
When you call close, it locks the entry, and then the cache (if ref count <= 0)

So they can be blocked on each other

Actor	Actor	cache	entry
	cache.Acquire()
	- cache.prepare()	lock
entry.Close()			lock
- ref count <= 0
- purge()		lock (blocked)
	- entry.refCount++		lock (blocked)

An easy fix is just to unlock the entry after the ref count change.

Suggested change

	close:func() {
	entry.lock.Lock()
	deferentry.lock.Unlock()

	entry.refCount--
	c.currentOpenFileReferences.Dec()
	ifentry.refCount>0 {
	return
	}

	entry.purge()
	},
	close:func() {

	entry.lock.Lock()
	entry.refCount--
	refCount:=entry.refCount
	entry.lock.Unlock()

	c.currentOpenFileReferences.Dec()
	ifrefCount>0 {
	return
	}

	entry.purge()
	},

Copy link

MemberAuthor

aslilacJun 27, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

honestly, there are only two places where I ended up using the entry locks, and one already holds the cache lock, and the other is likely to want to grab it. I don't think the entry locks are worth it. they mostly seem to be an opportunity for dead locks so I'm gonna get rid of them.

Copy link

MemberAuthor

aslilacJun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

actually, just kidding, I can't remove it because it makes callingpurge really awkward.

also, if I unlock theentry before callingpurge, someone else couldAcquire it, up the refcount, and it'd get purged anyway. which is probably a less serious bug than a deadlock but is still really annoying.

Copy link

MemberAuthor

aslilacJun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm thinking maybe we accept that small issue and turnrefCount into anatomic.Int32 to avoid some of the manual locking.

Emyrk reviewed

Jun 26, 2025

View reviewed changes

Copy link

Member

Emyrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Overall looking good.

There is a way to make the test queue up theAcquires at least for the first discrete group that gets an error. The value of the test can definitely be questioned 🤷

// TestCancelledFetch runs 2 Acquire calls in a queue, and ensures both return// the same error.funcTestCancelledFetch2(t*testing.T) {t.Parallel()fileID:=uuid.New()rdy:=make(chanstruct{})dbM:=dbmock.NewMockStore(gomock.NewController(t))expectedErr:=xerrors.New("expected error")// First call will fail with a custom error that all callers will return with.dbM.EXPECT().GetFileByID(gomock.Any(),gomock.Any()).DoAndReturn(func(mTx context.Context,fileID uuid.UUID) (database.File,error) {// Wait long enough for the second call to be queued up.<-rdyreturn database.File{},expectedErr})//nolint:gocritic // Unit testingctx:=dbauthz.AsFileReader(testutil.Context(t,testutil.WaitShort))// Expect 2 calls to Acquire before we continue the testvaracquiresQueued sync.WaitGroupacquiresQueued.Add(2)rawCache:=files.New(prometheus.NewRegistry(),&coderdtest.FakeAuthorizer{})varcache files.FileAcquirer=&acquireHijack{cache:rawCache,hook:func(_ context.Context,_ database.Store,_ uuid.UUID) {acquiresQueued.Done()},}varwg sync.WaitGroupwg.Add(2)// First call that will failgofunc() {_,err:=cache.Acquire(ctx,dbM,fileID)assert.ErrorIs(t,err,expectedErr)wg.Done()}()// Second call, that should succeedgofunc() {_,err:=cache.Acquire(ctx,dbM,fileID)assert.ErrorIs(t,err,expectedErr)wg.Done()}()// We need that second Acquire call to be queued upacquiresQueued.Wait()// Release the first Acquire call, which should make both calls return with the// expected error.close(rdy)// Wait for both go routines to assert their errors and finish.wg.Wait()require.Equal(t,0,rawCache.Count())}

coderd/files/cache.go

Comment on lines 255 to 256

		entry, ok := c.data[fileID]
		if !ok {

Copy link

Member

EmyrkJun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can this ever happen?purge is protected by async.Once, so an entry can only hit thedelete(c.data, fileID) once.

I like the defensive code, just wondering if the comment is accurate

Copy link

MemberAuthor

aslilacJun 27, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think it's possible now (god I hope it's not), but it could get messed up later and it would cause book keeping bugs if it continued past here.

coderd/files/cache.goShow resolvedHide resolved

coderd/files/cache.go OutdatedShow resolvedHide resolved

coderd/files/cache.go

Comment on lines +173 to 175

		if err := c.authz.Authorize(ctx, subject, policy.ActionRead, ev.Object); err != nil {
		e.close()
		return nil, err

Copy link

Member

EmyrkJun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do not change this, but I forgot something absolutely annoying about file authorizing.

coder/coderd/database/dbauthz/dbauthz.go

Lines 848 to 868 inaab0335

	// authorizeReadFile is a hotfix for the fact that file permissions are
	// independent of template permissions. This function checks if the user has
	// update access to any of the file's templates.
	func (q*querier)authorizeUpdateFileTemplate(ctx context.Context,file database.File)error {
	tpls,err:=q.db.GetFileTemplates(ctx,file.ID)
	iferr!=nil {
	returnerr
	}
	// There __should__ only be 1 template per file, but there can be more than
	// 1, so check them all.
	for_,tpl:=rangetpls {
	// If the user has update access to any template, they have read access to the file.
	iferr:=q.authorizeContext(ctx,policy.ActionUpdate,tpl);err==nil {
	returnnil
	}
	}

	returnNotAuthorizedError{
	Err:xerrors.Errorf("not authorized to read file %s",file.ID),
	}
	}

I really do not want to remedy this atm lol. We use a provisioner context for fetching things from the filecache so far. And is why we avoided it.

Copy link

MemberAuthor

aslilacJun 27, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm confused here. the comment calls the functionauthorizeReadFile but the actual function name isauthorizeUpdateFileTemplate. what exactly is the annoying thing here?

aslilac added2 commits

June 27, 2025 20:50

Merge branch 'main' into lilac/dont-cache-errors

c0ea08a

concurrency be like

78dedaa

Labels

None yet

2 participants

Movatterモバイル変換

chore: don't cache errors in file cache#18555

Are you sure you want to change the base?

chore: don't cache errors in file cache#18555

Conversation

aslilac commentedJun 24, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aslilacJun 27, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aslilac commentedJun 25, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Emyrk commentedJun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aslilacJun 27, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Emyrk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aslilacJun 27, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aslilacJun 27, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aslilac commentedJun 24, 2025•
edited
Loading

aslilacJun 27, 2025•
edited
Loading

aslilac commentedJun 25, 2025•
edited
Loading

aslilacJun 27, 2025•
edited
Loading

aslilacJun 27, 2025•
edited
Loading

aslilacJun 27, 2025•
edited
Loading