Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

chore: don't cache errors in file cache#18555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
aslilac wants to merge10 commits intomain
base:main
Choose a base branch
Loading
fromlilac/dont-cache-errors

Conversation

aslilac
Copy link
Member

@aslilacaslilac commentedJun 24, 2025
edited
Loading

By design, concurrent calls toAcquire in the file cache all share a single database fetch. This is by design, so that everyone can share in the success of whoever asked for the file first. That's kind of what caches do!

but one problem with the current implementation is that errors are also shared. This is mostly fine, because once all of the references are dropped, the cache entry will be freed, and the nextAcquire will trigger a new fetch. However, if enough people are trying to load the same file at once, you could imagine how they might keep retrying and the reference count neverquite hits zero.

To combat this, just immediately and forcibly remove errors from the cache, even if they still have references. Whoever is the first to retry afterwards will trigger a new fetch (like we want), which can then again be shared by others who retry.


Related, one opportunity to reduce the potential for errors we have is to usecontext.Background() for the database fetch so that a canceled request context cannot disrupt others who may be waiting for the file. We can then manually check the context outside of theLoad, just like we already do with authorization.

@aslilacaslilac requested a review fromEmyrkJune 24, 2025 22:50
@aslilacaslilac marked this pull request as ready for reviewJune 24, 2025 22:50
Comment on lines 161 to 163
// Check if the caller's context was canceled
iferr:=ctx.Err();err!=nil {
returnnil,err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why do we need to check this? It will fail theAuthorize if it is cancelled

And if we do this check, do we need to close?

Suggested change
// Check if the caller's context was canceled
iferr:=ctx.Err();err!=nil {
returnnil,err
// Check if the caller's context was canceled
iferr:=ctx.Err();err!=nil {
e.close()
returnnil,err

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I just thought it felt nice to check explicitly 🤷‍♀️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we defer the context handling toAuthorize?

It could be cancelled right after this check. It does not protect us from anything down the callstack.

@aslilac
Copy link
MemberAuthor

aslilac commentedJun 25, 2025
edited
Loading

btw@Emyrk, the test as you originally wrote it assumed thatany second caller would refetch, regardless of timing. but we discussed loosening it a bit so that any callerafter the actual errored load would refetch, which is much more timing dependent. I can't really think of a good way to definitively test this behavior, because waiting until after the first fetch errors to run the second fetch means we're also waiting until the refcount would hit zero, which would clear it regardless of error state anyway. but if we call any earlier, most of the time the second caller gets the error, rarely taking long enough to trigger a refetch.

maybe we could add some method to "leak" a reference for testing purposes to ensure that the file is refetched anyway, but I'm never a fan of adding extra complexity just to make something testable.

@Emyrk
Copy link
Member

btw@Emyrk, the test as you originally wrote it assumed that any second caller would refetch, regardless of timing. but we discussed loosening it a bit so that any caller after the actual errored load would refetch, which is much more timing dependent. I can't really think of a good way to definitively test this behavior, because waiting until after the first fetch errors to run the second fetch means we're also waiting until the refcount would hit zero, which would clear it regardless of error state anyway. but if we call any earlier, most of the time the second caller gets the error, rarely taking long enough to trigger a refetch.

Yes, 100% the original test is not really relevant anymore.

maybe we could add some method to "leak" a reference for testing purposes to ensure that the file is refetched anyway, but I'm never a fan of adding extra complexity just to make something testable.

I wonder if we can make something work with aninternal test and manually calling the lock 🤔. I don't have any fancy ideas off the top of my head 😢

Comment on lines +216 to +227
close: func() {
entry.lock.Lock()
defer entry.lock.Unlock()

entry.refCount--
c.currentOpenFileReferences.Dec()
if entry.refCount > 0 {
return
}

entry.purge()
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Locking behavior

  • Acquire locks Cache, then Entry
  • Close locks Entry, then purge locks the cache

I think there is a deadlock.

If you callAcquire the cache gets locked, then the entry.
When you call close, it locks the entry, and then the cache (if ref count <= 0)

So they can be blocked on each other

ActorActorcacheentry
cache.Acquire()
- cache.prepare()lock
entry.Close()lock
- ref count <= 0
- purge()lock (blocked)
- entry.refCount++lock (blocked)

An easy fix is just to unlock the entry after the ref count change.

Suggested change
close:func() {
entry.lock.Lock()
deferentry.lock.Unlock()
entry.refCount--
c.currentOpenFileReferences.Dec()
ifentry.refCount>0 {
return
}
entry.purge()
},
close:func() {
entry.lock.Lock()
entry.refCount--
refCount:=entry.refCount
entry.lock.Unlock()
c.currentOpenFileReferences.Dec()
ifrefCount>0 {
return
}
entry.purge()
},

Copy link
Member

@EmyrkEmyrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Overall looking good.

There is a way to make the test queue up theAcquires at least for the first discrete group that gets an error. The value of the test can definitely be questioned 🤷

// TestCancelledFetch runs 2 Acquire calls in a queue, and ensures both return// the same error.funcTestCancelledFetch2(t*testing.T) {t.Parallel()fileID:=uuid.New()rdy:=make(chanstruct{})dbM:=dbmock.NewMockStore(gomock.NewController(t))expectedErr:=xerrors.New("expected error")// First call will fail with a custom error that all callers will return with.dbM.EXPECT().GetFileByID(gomock.Any(),gomock.Any()).DoAndReturn(func(mTx context.Context,fileID uuid.UUID) (database.File,error) {// Wait long enough for the second call to be queued up.<-rdyreturn database.File{},expectedErr})//nolint:gocritic // Unit testingctx:=dbauthz.AsFileReader(testutil.Context(t,testutil.WaitShort))// Expect 2 calls to Acquire before we continue the testvaracquiresQueued sync.WaitGroupacquiresQueued.Add(2)rawCache:=files.New(prometheus.NewRegistry(),&coderdtest.FakeAuthorizer{})varcache files.FileAcquirer=&acquireHijack{cache:rawCache,hook:func(_ context.Context,_ database.Store,_ uuid.UUID) {acquiresQueued.Done()},}varwg sync.WaitGroupwg.Add(2)// First call that will failgofunc() {_,err:=cache.Acquire(ctx,dbM,fileID)assert.ErrorIs(t,err,expectedErr)wg.Done()}()// Second call, that should succeedgofunc() {_,err:=cache.Acquire(ctx,dbM,fileID)assert.ErrorIs(t,err,expectedErr)wg.Done()}()// We need that second Acquire call to be queued upacquiresQueued.Wait()// Release the first Acquire call, which should make both calls return with the// expected error.close(rdy)// Wait for both go routines to assert their errors and finish.wg.Wait()require.Equal(t,0,rawCache.Count())}

Comment on lines 255 to 256
entry, ok := c.data[fileID]
if !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can this ever happen?purge is protected by async.Once, so an entry can only hit thedelete(c.data, fileID) once.

I like the defensive code, just wondering if the comment is accurate

Comment on lines +284 to +292
// Because many callers can be waiting on the same file fetch concurrently, we
// want to prevent any failures that would cause them all to receive errors
// because the caller who initiated the fetch would fail.
// - We always run the fetch with an uncancelable context, and then check
// context cancellation for each acquirer afterwards.
// - We always run the fetch as a system user, and then check authorization
// for each acquirer afterwards.
// This prevents a canceled context or an unauthorized user from "holding up
// the queue".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

👍

Comment on lines +249 to +250
// purge immediately removes an entry from the cache, even if it has open references.
// It should only be called from the `close` function in a `cacheEntry`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It should only be called from theclose function in acacheEntry

Stale comment. We call it directly inAcquire ifLoad returns an error.

Comment on lines +173 to 175
if err := c.authz.Authorize(ctx, subject, policy.ActionRead, ev.Object); err != nil {
e.close()
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do not change this, but I forgot something absolutely annoying about file authorizing.

// authorizeReadFile is a hotfix for the fact that file permissions are
// independent of template permissions. This function checks if the user has
// update access to any of the file's templates.
func (q*querier)authorizeUpdateFileTemplate(ctx context.Context,file database.File)error {
tpls,err:=q.db.GetFileTemplates(ctx,file.ID)
iferr!=nil {
returnerr
}
// There __should__ only be 1 template per file, but there can be more than
// 1, so check them all.
for_,tpl:=rangetpls {
// If the user has update access to any template, they have read access to the file.
iferr:=q.authorizeContext(ctx,policy.ActionUpdate,tpl);err==nil {
returnnil
}
}
returnNotAuthorizedError{
Err:xerrors.Errorf("not authorized to read file %s",file.ID),
}
}

I really do not want to remedy this atm lol. We use a provisioner context for fetching things from the filecache so far. And is why we avoided it.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@EmyrkEmyrkEmyrk left review comments

At least 1 approving review is required to merge this pull request.

Assignees

@aslilacaslilac

Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@aslilac@Emyrk

[8]ページ先頭

©2009-2025 Movatter.jp