track the size of the mimalloc pages that are deferred
introduce_Py_qsbr_advance_with_size() to reduce duplicated code
adjust the logic of when we advance the global write sequence and when we process the queue of deferred memory
small fix for the goal returned in the advance case, it is safe to return the new global write sequence, not the next write sequence

With these changes, the memory held by QSBR is typically freed a bit more quickly and the process RSS stays a bit smaller.

Regarding the changes to advance and processing,GH-135107 has the following minor issues: if the memory threshold is exceeded when a new item is added, byfree_delayed(), we immediately setmemory_deferred = 0 and process. It is very unlikely that the goal has been reached for the newly added item. If that's a big chunk of memory, we would have to wait until thenext process in order to actually free it. This PR tries to avoid that by storing theseq (local read sequence) as it was at last process time. If that hasn't changed (this thread hasn't entered a quiescent state) then we wait before processing. This at least gives a chance that other readers will catch up and the process can actually free things.

This PR also changes how often we can defer the advance of the global write sequence. Previously, we deferred it up to 10 times. However, I think there is not much benefit to advancing it unless we are nearly ready to process. So, theshould_advance_qsbr() is checking if it seems time to process. The_Py_qsbr_should_process() checks if the local read sequence has been updated. That means the write sequence has advanced (it's time to process) and the read sequence for this thread has also advanced. This doesn't tell us that the other threads have advanced their read sequence but we don't want to pay the cost of checking that (would require "poll").

pyperformance memory usage results

Issue:Memory keeps increasing with fixed-size dict during multi-threaded set/delete in 3.13.3t #133136

colesburyand others added3 commits

June 3, 2025 21:29

pythongh-133136: Limit excess memory held by QSBR

7ef2e30

The free threading build uses QSBR to delay the freeing of dictionarykeys and list arrays when the objects are accessed by multiple threadsin order to allow concurrent reads to proceeed with holding the objectlock. The requests are processed in batches to reduce executionoverhead, but for large memory blocks this can lead to excess memoryusage.Take into account the size of the memory block when deciding when toprocess QSBR requests.

Fix unused function warning

ce9232b

Re-work QSBR deferred advance and processing.

3978e35

bedevere-appbot mentioned this pull request

Jun 13, 2025

Memory keeps increasing with fixed-size dict during multi-threaded set/delete in 3.13.3t#133136

Open

nascheme added the topic-free-threading label

Jun 13, 2025

Update comments.

0b276ab

colesbury mentioned this pull request

Jun 16, 2025

gh-133136: Limit excess memory held by QSBR#135107

Closed

colesbury reviewed

Jun 16, 2025

View reviewed changes

Python/qsbr.c OutdatedShow resolvedHide resolved

Objects/obmalloc.c OutdatedShow resolvedHide resolved

Python/qsbr.c OutdatedShow resolvedHide resolved

Objects/obmalloc.c Outdated

Comment on lines 143 to 144

		size_t bsize = mi_page_block_size(page);
		page->qsbr_goal = _Py_qsbr_advance_with_size(tstate->qsbr, page->capacity*bsize);

Copy link

Contributor

colesburyJun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This might be the right heuristic, but this is a bit different from_PyMem_FreeDelayed:

_PyMem_FreeDelayed holds onto the memory until quiescence. It prevents the memory from being used for any purpose.
_PyMem_mi_page_maybe_free only prevents the page from being used by another thread or for a different size class. That's a lot less restrictive.

Copy link

MemberAuthor

naschemeJun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah good point. The memory being held (avoiding collection) by mimalloc is not at all the same as the deferred frees. I reworked the PR so that memory is tracked separately. I also decoupled the write sequence advance from the triggering of_PyMem_ProcessDelayed(), usedprocess_seq as a target value for the read sequence.

Now_qsbr_thread_state is larger than 64 bytes. I don't think that should be a problem.

nascheme added4 commits

June 16, 2025 13:13

Revise based on review feedback.

7ea28ff

* Keep separate count of mimalloc page memory that is deferred from  collection.  This memory doesn't get freed by _PyMem_ProcessDelayed().  We want to advance the write sequence if there is too much of it  but calling _PyMem_ProcessDelayed() is not helpful.* Use `process_seq` variable to schedule the next call to  `_PyMem_ProcessDelayed()`.* Rename advance functions to have "deferred" in name.* Move `_Py_qsbr_should_process()` call up one level.

Revert change to _Py_qsbr_advance().

d2f5acc

Since _Py_atomic_add_uint64() returns the old value, we need to addQSBR_INCR.

Call _PyMem_FreeDelayed() from eval breaker.

ee13392

Refactor code to keep obmalloc logic out of the qsbr.c file.  Call_PyMem_ProcessDelayed() from the eval breaker.

Merge branch 'origin/main' intopythongh-133136-qsbr-defer-process

b568dbb

Copy link

MemberAuthor

nascheme commentedJun 17, 2025

After reverting the erroneous change to_Py_qsbr_advance(), the nice reductions in RSS I was seeing disappeared. After some experimentation, running_PyMem_ProcessDelayed() from the eval breaker works well. That seems to give enough time so that usually the read sequence has advanced such that deferred memory can be quickly freed.

I refactored the code to put the "should advance" logic into the obmalloc file. I think that makes more sense compared with having it in the qsbr.c file.

The dict_mutate_qsbr_mem.py.txt benchmark RSS sizes, in MB:

Running with the "main" branch, FT build (commit 1ffe913): 312, 543, 728, 912, 1142.
Default build using mimalloc instead of pymalloc: 89, 90, 134, 134, 90.
gh-133136: Limit excess memory held by QSBR #135107: 351, 374, 393, 484, 532.
This PR: 205, 260, 293, 312, 288, 288.

Copy link

MemberAuthor

nascheme commentedJun 17, 2025

Updated pyperformance results:

run time

memory usage

nascheme marked this pull request as ready for review

June 20, 2025 19:10

nascheme requested review fromericsnowcurrently,markshannon andmethane ascode owners

June 20, 2025 19:10

bedevere-appbot added the awaiting core review label

Jun 20, 2025

Copy link

Contributor

tom-pytel commentedJun 21, 2025

I did essentailly the same thing here:#132520, but got the following comment about quadratic behavior:#132520 (review)

Is that no longer an issue or does it still apply?

colesbury approved these changes

Jun 24, 2025

View reviewed changes

Copy link

Contributor

colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good to me. Two comments below

Objects/obmalloc.cShow resolvedHide resolved

bedevere-appbot added awaiting merge and removed awaiting core review labels

Jun 24, 2025

Copy link

Contributor

colesbury commentedJun 24, 2025

@tom-pytel - the if statement guarding the_PyMem_ProcessDelayed call here is important. There's still potential quadratic behavior (because of the linear scan of theads), but I'm more confident in this approach.

https://github.com/python/cpython/pull/135473/files#diff-b7d806d282eab9f532468633d9090ed0a7f3215d8c6bcae04f4f8547baa39da1R1391-R1393

nascheme added2 commits

June 24, 2025 23:19

Merge 'origin/main' intopythongh-133136-qsbr-defer-process

f5f7a3f

Avoid overflow in deferred memory accumulators.

03f01b0

nascheme merged commit113de85 intopython:main

Jun 25, 2025

40 checks passed

bedevere-appbot removed the awaiting merge label

Jun 25, 2025

nascheme added performance

Performance or resource usage

needs backport to 3.13bugs and security fixes labels

Jun 25, 2025

nascheme added the needs backport to 3.14bugs and security fixes label

Jun 25, 2025

Copy link

miss-islington-appbot commentedJun 25, 2025

Thanks@nascheme for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

Copy link

miss-islington-appbot commentedJun 25, 2025

Thanks@nascheme for the PR 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

Copy link

miss-islington-appbot commentedJun 25, 2025

Sorry,@nascheme, I could not cleanly backport this to3.13 due to a conflict.
Please backport usingcherry_picker on command line.

cherry_picker 113de8545ffe74a4a1dddb9351fa1cbd3562b621 3.13

miss-islington-appbot assignednascheme

Jun 25, 2025

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request

Jun 25, 2025

pythonGH-133136: Revise QSBR to reduce excess memory held (pythongh-1…

d7d6412

…35473)The free threading build uses QSBR to delay the freeing of dictionarykeys and list arrays when the objects are accessed by multiple threadsin order to allow concurrent reads to proceed with holding the objectlock. The requests are processed in batches to reduce executionoverhead, but for large memory blocks this can lead to excess memoryusage.Take into account the size of the memory block when deciding when toprocess QSBR requests.Also track the amount of memory being held by QSBR for mimalloc pages.  Advance the write sequence if this memory exceeds a limit.  Advancing the sequence will allow it to be freed more quickly.Process the held QSBR items from the "eval breaker", rather than from `_PyMem_FreeDelayed()`.  This gives a higher chance that the global read sequence has advanced enough so that items can be freed.(cherry picked from commit113de85)Co-authored-by: Neil Schemenauer <nas-github@arctrix.com>Co-authored-by: Sam Gross <colesbury@gmail.com>