NotificationsYou must be signed in to change notification settings
Fork33.3k
Star69.7k

gh-129201: Use prefetch in GC mark alive phase.#129203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

nascheme merged 12 commits intopython:mainfromnascheme:gh-129201-gc-mark-prefetch

Feb 5, 2025

Merged

gh-129201: Use prefetch in GC mark alive phase.#129203

nascheme merged 12 commits intopython:mainfromnascheme:gh-129201-gc-mark-prefetch

Feb 5, 2025

Conversation

Copy link

Member

nascheme commentedJan 22, 2025•
edited
Loading

This PR implements prefetching as suggested in the issue. The benchmarks that are part of pyperformance generally don't show much difference since they don't use enough memory to make prefetch effective.

Issue:Use "prefetch" CPU instructions during the marking phase of the GC #129201

Source code for the two "big" benchmarks. These create a fairly large object graph and then callgc.collect() to time it.

gc big tree
gc big

Updated benchmarks for commitabfc49a. The5ff2fbc commit made previous versions of this PR look much worse on certain benchmarks. That's because that commit improves the "mark alive" pass so that it can find many more alive objects in some cases (fixes finding objects referred to only from stack).

I tried optimizing this PR so that it was at least as fast as current "main" when the number of objects was relatively small. However, I was not successful with that so I instead made it so that the prefetch approach is only used if the number of long-lived objects is over 200k. Based on tests on my Ryzen 5, that seems about the point were the prefetch approach is faster.

Run on a Macbook Pro M3, clang 19, PGO/LTO enabled. Both "base" and "prefetch" are configured with./configure --enable-optimizations --disable-gil --with-lto.

benchmark	build	prefetch	time [ms]	speedup
gc_big	base	-	787	1.00
gc_big	prefetch	on	651	1.21
gc_big	prefetch	off	787	1.00

gc_big_tree	base	-	334	1.00
gc_big_tree	prefetch	on	293	1.14
gc_big_tree	prefetch	off	318	1.05

bm_gc_collect	base	-	1200	1.00
bm_gc_collect	prefetch	on	1199	1.00
bm_gc_collect	prefetch	off	1191	1.01

bm_gc_traversal	base	-	619	1.00
bm_gc_traversal	prefetch	on	620	1.00
bm_gc_traversal	prefetch	off	619	1.00

Benchmarks below run on a Ryzen 5 7600X.

benchmark	build	prefetch	time [ms]	speedup
gc_big	base	-	634	1.00
gc_big	prefetch	on	452	1.40
gc_big	prefetch	off	1013	0.63

gc_big_tree	base	-	425	1.00
gc_big_tree	prefetch	on	346	1.23
gc_big_tree	prefetch	off	502	0.85

pyperformance results comparing to merge base.

Some notes about the implementation:

On my Macbook M3, it seems that usingPREFETCH_T0 is slightly faster whereas on my AMD Ryzen 5 desktop thePREFETCH_T1 is a bit faster. The difference is pretty small. There are also prefetch instructions that indicate you are going to write to the word (second parameter in the__builtin_prefetch() variation). If we are visiting a GC object for the first time, we will be writing to it to set the "alive" bit in the flags. However if it's not a GC object or we have visited it already, we don't write. So reads would seem quite a bit more common.
cross-platform prefetch preprocessor logic was inspired by the zstd source code. I haven't tested the Windows or__aarch64__ conditions of that block.
TheBUFFER_SIZE,BUFFER_HI, andBUFFER_LO have been moderately tuned based on benchmarking on my Ryzen 5 machine. I'm kind of surprised thatBUFFER_HI is so small but that's what works best. There could be some additional tuning done on the logic of when to put objects on the stack vs into the buffer, when to push a new "span" vs adding part of it to the buffer. I did what I thought was logical but didn't benchmark all different approaches.
I run some tests that recorded a trace of the prefetch and access (following pointer) and then computed some statistics from the trace log. Based on this, it seems like the prefetch buffer is working fairly well (the median delay between prefetch and access was 14 on one benchmark run, ASCII histogram below).

Prefetch window statsmean 52.07285263157895median 14.0max 256   25.60 |65,304  | ******************************   51.20 |5,590   | **   76.80 |3,562   | *  102.40 |2,683   | *  128.00 |2,278   | *  153.60 |2,285   | *  179.20 |2,377   | *  204.80 |2,238   | *  230.40 |2,753   | *  256.00 |5,930   | **-------- |------- | -------      N= |95,000

Having the specializedgc_mark_traverse_list andgc_mark_traverse_tuple functions seems like a win. Even if the program doesn't use many long lists and tuples it doesn't cost much to have them.
The_PyTuple_MaybeUntrack() call is a bit expensive and it gets done on every collection. I considered adding a_PyGC_BITS_NEW bit and set it on newly created tuples. Then we would only need to check a tuple once rather than once per collection. I tried disabling the tuple untracking but things got slower.
As currently implemented, all the object pointers revealed bytp_traverse go through the mark buffer/stack. That seems wasteful given that a good fraction of them will be non-GC objects and some of them are known things like None or True. I tried adding a check for BSS allocated objects (using__bss_start andend) but that doesn't seem like a win. Maybe with some more refinement this would work. We could allocate a bunch of known non-GC objects into a certain region of memory and thengc_propagate_alive() could just quickly skip them based on the pointer address.
I considered adding a new number field to the type object, to allowgc_propagate_alive() to use a switch statement to use the correct traverse function. Something likeswitch (Py_TYPE(ob)->tp_gc_ordinal) where thetp_gc_ordinal value on known types is set on startup. That way, the switch statement can use an efficient jump table or similar. However, with only list and tuple specialized at this time, it doesn't seem worth it.

Use prefetch in GC mark alive phase.

1dd4bb6

bedevere-appbot mentioned this pull request

Jan 22, 2025

Use "prefetch" CPU instructions during the marking phase of the GC#129201

Closed

nascheme added type-feature

A feature request or enhancement

performance

Performance or resource usage

topic-free-threading labels

Jan 22, 2025

nascheme added3 commits

January 22, 2025 15:04

Add prefetch() macro that might work on MSVC.

1b4e8c3

Improve portability of prefetch macros.

9fbfd43

Implement "spans" for marking phase.

7f51104

When traversing a list or tuple, use a "span" if the buffer can't holdall the items from the collection.  This reduces the size of the objectstack needed if large collections are encountered.  It also helpskeeps the buffer size optimal for prefetching.

nascheme force-pushed thegh-129201-gc-mark-prefetch branch fromfe9898a to7f51104Compare

January 24, 2025 07:49

nascheme added2 commits

January 24, 2025 13:11

Fix bug in "span" enqueue logic.

041b2e4

It's possible for lists or tuples to have a NULL item.  Handle thatin the case that all item elements fit into the buffer.

Add some asserts, minor code cleanup.

86a5c95

Copy link

MemberAuthor

nascheme commentedJan 27, 2025

Benchmarks for an earlier version of the PR, without the "span" for lists/tuple.

The benchmark results below come from my own workstation (AMD Ryzen 5 7600X, Linux, GCC 12.2.0). I'm using--enable-optimizations but LTO is turned off. The "prefetch off" means the prefetch CPU instruction is omitted, otherwise code is the same.

There is something bad happening with the default build GC isgc.disable() is not used when creating a big data structure. When enabled, the "gc big" benchmark takes 5x as long. My guess is that the generation 0 collections are shuffling the objects in the gc next/prev linked list but that's only a guess.

The "bm_gc_collect" benchmark was taken from pyperformance and the constants adjusted:CYCLES = 100_000 andLINKS = 40.

The "prefetch (7f756eb0)" code branch is essentially the same as this PR (1b4e8c3). I just rebased it on the current main and removed some dead code.

code branch	build	prefetch on	benchmark	time [ms]
benchmark: bm_async_tree_io_tg
prefetch (7f756eb0)	nogil	yes	bm_async_tree_io_tg	6,752
prefetch (7f756eb0)	nogil	no	bm_async_tree_io_tg	6,964
main (`3829104`)	nogil	-	bm_async_tree_io_tg	6,960
main (`3829104`)	default	-	bm_async_tree_io_tg	7,323

benchmark: bm_gc_collect
prefetch (7f756eb0)	nogil	yes	bm_gc_collect	1,418
prefetch (7f756eb0)	nogil	no	bm_gc_collect	1,517
main (`3829104`)	nogil	-	bm_gc_collect	1,473
main (`3829104`)	default	-	bm_gc_collect	20,966

benchmark: gc big
prefetch (7f756eb0)	nogil	yes	gc big	587
prefetch (7f756eb0)	nogil	no	gc big	1,064
main (`3829104`)	nogil	-	gc big	963
main (`3829104`)	default	-	gc big	5,785

benchmark: gc big tree
prefetch (7f756eb0)	nogil	yes	gc big tree	363
prefetch (7f756eb0)	nogil	no	gc big tree	519
main (`3829104`)	nogil	-	gc big tree	723
main (`3829104`)	default	-	gc big tree	3,443

benchmark: gc big (gc disabled)
main (`3829104`)	default	-	gc big (gc disabled)	641

nascheme marked this pull request as ready for review

January 27, 2025 20:49

bedevere-appbot added the awaiting core review label

Jan 27, 2025

Copy link

Member

corona10 commentedJan 29, 2025•
edited
Loading

Just saying this in passing, I recommend using the word "freethreaded" intead of "nogil". :)

Merge branch 'origin/main' intopythongh-129201-gc-mark-prefetch

62bace2

nascheme requested a review frommpage

January 30, 2025 21:12

Fix bug in tuple untrack logic.

e12dd2e

Need to clear the "alive" bit.

nascheme marked this pull request as draft

January 31, 2025 08:58

bedevere-appbot removed the awaiting core review label

Jan 31, 2025

Copy link

MemberAuthor

nascheme commentedJan 31, 2025

Merging with 'main' has pretty significantly impacted the gc_traversal benchmark, for the worse. I'll do some investigation on that. Something to do with5ff2fbc would be my guess.

nascheme added2 commits

January 31, 2025 21:38

Improve "prefetch" instruction macros.

15ada78

Use of prefetching conditionally.

abfc49a

Using the prefetch buffer only helps if there are enough objects.  Usethe long-lived count to decide if it's worth enabling.  If not, fallbackto the previous method of marking objects alive (dereference objectpointers as we encounter them).Improve code by adding some additional helper functions, adding commentsand general tidying.  The buffer logic has been changed to use a maskfor size rather than the % operator.  Some other small optimizationsthat only help a little.

nascheme marked this pull request as ready for review

February 3, 2025 21:18

bedevere-appbot added the awaiting core review label

Feb 3, 2025

mpage approved these changes

Feb 4, 2025

View reviewed changes

Copy link

Contributor

mpage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM. The use of spans is a nice touch :) I'd love to see this on a real world application with a larger heap than what's currently available in the benchmark suite. Might also be good to get another set of eyes who are familiar with the GC (Dino or Sam?) to take a look as well.

Python/gc_free_threading.c OutdatedShow resolvedHide resolved

bedevere-appbot added awaiting merge and removed awaiting core review labels

Feb 4, 2025

nascheme added2 commits

February 4, 2025 17:18

Add assert for buffer size settings.

652cf33

Merge branch 'origin/main' intopythongh-129201-gc-mark-prefetch

e094a1c

nascheme merged commitcdcacec intopython:main

Feb 5, 2025

42 checks passed

bedevere-appbot removed the awaiting merge label

Feb 5, 2025

Copy link

bedevere-bot commentedFeb 5, 2025

⚠️⚠️⚠️ Buildbot failure⚠️⚠️⚠️

Hi! The buildbotaarch64 RHEL8 LTO 3.x has failed when building commitcdcacec.

What do you need to do:

Don't panic.
Checkthe buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/338/builds/7997) and take a look at the build logs.
Check if the failure is related to this commit (cdcacec) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/338/builds/7997

Summary of the results of the build (if available):

Click to see traceback logs

Traceback (most recent call last):  File"/home/buildbot/buildarea/3.x.cstratak-RHEL8-aarch64.lto/build/Lib/threading.py", line1054, in_bootstrap_innerself.run()~~~~~~~~^^  File"/home/buildbot/buildarea/3.x.cstratak-RHEL8-aarch64.lto/build/Lib/threading.py", line996, inrunself._target(*self._args,**self._kwargs)~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File"/home/buildbot/buildarea/3.x.cstratak-RHEL8-aarch64.lto/build/Lib/test/test_interpreters/test_stress.py", line47, inrun    interp= interpreters.create()  File"/home/buildbot/buildarea/3.x.cstratak-RHEL8-aarch64.lto/build/Lib/test/support/interpreters/__init__.py", line76, increateid= _interpreters.create(reqrefs=True)interpreters.InterpreterError:interpreter creation failedk

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request

Feb 7, 2025

pythongh-129201: Use prefetch in GC mark alive phase. (pythongh-129203)

462484c

For the free-threaded version of the cyclic GC, restructure the "mark alive" phase to use software prefetch instructions.  This gives a speedup in most cases when the number of objects is large enough.  The prefetching is enabled conditionally based on the number of long-lived objects the GC finds.

cmaloney pushed a commit to cmaloney/cpython that referenced this pull request

Feb 8, 2025

pythongh-129201: Use prefetch in GC mark alive phase. (pythongh-129203)

1bfea88

For the free-threaded version of the cyclic GC, restructure the "mark alive" phase to use software prefetch instructions.  This gives a speedup in most cases when the number of objects is large enough.  The prefetching is enabled conditionally based on the number of long-lived objects the GC finds.

Labels

performance

Performance or resource usage

topic-free-threading type-feature

A feature request or enhancement

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-129201: Use prefetch in GC mark alive phase.#129203

gh-129201: Use prefetch in GC mark alive phase.#129203

Uh oh!

Conversation

nascheme commentedJan 22, 2025•
edited
Loading

Uh oh!

Uh oh!

nascheme commentedJan 27, 2025

Uh oh!

corona10 commentedJan 29, 2025•
edited
Loading

Uh oh!

Uh oh!

nascheme commentedJan 31, 2025

Uh oh!

mpage left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commentedFeb 5, 2025

⚠️⚠️⚠️ Buildbot failure⚠️⚠️⚠️

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Movatterモバイル変換

Uh oh!

gh-129201: Use prefetch in GC mark alive phase.#129203

gh-129201: Use prefetch in GC mark alive phase.#129203

Uh oh!

Conversation

nascheme commentedJan 22, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

nascheme commentedJan 27, 2025

Uh oh!

corona10 commentedJan 29, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

nascheme commentedJan 31, 2025

Uh oh!

mpage left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commentedFeb 5, 2025

⚠️⚠️⚠️ Buildbot failure⚠️⚠️⚠️

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nascheme commentedJan 22, 2025•
edited
Loading

corona10 commentedJan 29, 2025•
edited
Loading