NotificationsYou must be signed in to change notification settings
Fork32.1k
Star67.3k

gh-115999: Specialize`LOAD_ATTR` for instance and class receivers in free-threaded builds#128164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

mpage merged 75 commits intopython:mainfrommpage:gh-115999-load-attr-instance-merged

Jan 14, 2025

Merged

gh-115999: Specialize`LOAD_ATTR` for instance and class receivers in free-threaded builds#128164

mpage merged 75 commits intopython:mainfrommpage:gh-115999-load-attr-instance-merged

Jan 14, 2025

Conversation

Copy link

Contributor

mpage commentedDec 21, 2024•
edited
Loading

This PR finishes specialization forLOAD_ATTR in the free-threaded build by adding support for class and instance receivers.

The bulk of it is dedicated to making the specialized instructions and the specialization logic thread-safe. This consists of using atomics / locks in the appropriate places, avoiding races in specialization related to reloading versions, and ensuring that the objects stored in inline caches remain valid when accessed (by only storing deferred objects). See the section on "Thread Safety" below for more details.

Additionally, making this work required a few unrelated changes to fix existing bugs or work around differences between the two builds that results from only storing deferred values (which causes in specialization failures in the free-threaded build when a value that would be stored in the cache is not deferred):

Fixed a bug in the cases generator where it wasn't treating macro tokens as terminators when searching for the assignment target of an expression involvingPyStackRef_FromPyObjectNew.
Refactoredtest_descr.MiscTests.test_type_lookup_mro_reference to work when specialization fails (and also be a behavorial test).
Temporarily skiptest_capi.test_type.TypeTests.test_freeze_meta when running refleaks tests on free-threaded builds. Specialization failure triggers an existing bug.

Single-threaded Performance

Performance is improved by ~12% on free-threaded builds.
Performance is ~neutral on default builds. (Technically the benchmark runner reports a 0.2% improvement, but that looks like noise to me.)

We're leaving a bit of perf on the table by only storing deferred objects: we can't specialize attribute lookups that resolve to class attributes (e.g. counters, settings). I haven't measured how much perf we're giving up, but I'd like to address that in a separate PR.

Scalability

Theobject_cfunction andpymethod benchmarks are improved (1.4x slower -> 14.3x faster, 1.8x slower -> 13.0x faster, respectively). Other benchmarks appear unchanged.

I would expect thatcmodule_function would also improve, but it looks like the benchmark is bottlenecked on increfing theint.__floor__ method that is returned from the call to_PyObject_LookupSpecial inmath_floor (the incref happens in_PyType_LookupRef, which is called byPyObject_LookupSpecial):

cpython/Modules/mathmodule.c

Line 1178 in3879ca0

PyObject*method=_PyObject_LookupSpecial(number,state->str___floor__);

Raw numbers:

Running benchmarks with 28 threads                   Merge-base    This PRobject_cfunction    1.4x slower  14.3x fastercmodule_function    1.7x slower   1.7x slowermult_constant      12.8x faster  13.6x fastergenerator          11.5x faster  12.2x fasterpymethod            1.8x slower  13.4x fasterpyfunction         12.2x faster  13.0x fastermodule_function    14.5x faster  14.8x fasterload_string_const  13.0x faster  13.2x fasterload_tuple_const   12.7x faster  14.1x fastercreate_pyobject    12.8x faster  13.5x fastercreate_closure     13.9x faster  14.8x fastercreate_dict        10.5x faster  13.3x fasterthread_local_read   3.9x slower   4.0x slower

Thread Safety

Thread safety of specialized instructions is addressed in a few different ways:

Use atomics where needed.
Lock the instance dictionary in_LOAD_ATTR_WITH_HINT.
Only store deferred objects in the inline cache. All uops that retrieve objects from the cache are preceded by type version guards, which ensure that the (deferred) value retrieved from the cache is valid. The type version will be mutated before destroying the reference in the type. If the type held the last reference, then GC will need to run in order to reclaim the object. GC requires stopping the world, so if GC reclaims the object then we will see the new type version and the guard will fail.

Thread safety of specialization is addressed using similar techniques:

Atomics / locking is used to access shared state.
Specialization code is refactored to read versions at the start of specialization and store the same version in caches (as opposed to rereading the version). This ensures that the specialization is consistent with the version by avoiding races where the type or shared keys dictionary changes after we've queried it.

Stats

Following theinstructions in the comment precedingspecialize_attr_loadclassattr, I collected stats for the default build for both this PR and its merge base using./python -m test_typing test_re test_dis test_zlib and compared them usingTools/scripts/summarize_stats.py. The results forLOAD_ATTR are nearly identical and are consistent with results from comparing the merge base against itself:

Issue:Make the specializing interpreter thread-safe in--disable-gil builds #115999

mpage added30 commits

December 18, 2024 14:13

Add _PyDictKeys_StringLookupAndVersion

8eeb4fe

Look up a unicode key in an all unicode keys object along with thekeys version, assigning one if not present.We need a keys version that is consistent with presence of the keyfor use in the guards.

Pass shared keys version to specialization

fcd05a0

Reading the shared keys version and looking up the key need to be performedatomically. Otherwise, a key inserted after the lookup could race with readingthe version, causing us to incorrectly specialize that no key shadows thedescriptor.

Add support for enabling each of the instance attribute kinds

5c03db0

Everything starts out disabled

Only cache deferred descriptors

a576748

Make analyze_descriptor thread-safe

8475fd6

* Check that the type hasn't changed across lookups* Descr is now an owned ref

Use an atomic load for GUARD_TYPE_VERSION

29c3356

Use atomics to load valid bit for inline values in _GUARD_DORV_VALUES…

4f9eeb3

…_INST_ATTR_FROM_DICT

Use atomic to load keys version in _GUARD_KEYS_VERSION

e9050fc

Use an atomic load for managed dict in_CHECK_ATTR_METHOD_LAZY_DICT

ade57f2

Enable specialization of method loads

0a87264

Use atomics for fetching type flags

816e22f

Get a strong reference to dict in instance_has_key

ca1e232

Split specialize_dict_access

945b61c

Pass type version to specialize_dict_access

feb7d34

Take a critical section around dict specialization

ecfd199

Use thread-safe version of _PyDictKeys_StringLookup

9fda5db

Use atomic load in _CHECK_MANAGED_OBJECT_HAS_VALUES

3b2c220

Make _LOAD_ATTR_INSTANCE_VALUE thread-safe

408e44b

- Use atomic load for value- Use _Py_TryIncrefCompareStackRef for incref

Make _LOAD_ATTR_WITH_HINT thread-safe

16aab70

Specialize instance accesses

d0920ea

Specialize LOAD_ATTR_SLOT

e7cea82

- Use atomics and _Py_TryIncrefCompareStackRef in _LOAD_ATTR_SLOT- Pass type version during specialization

Enable LOAD_ATTR_PROPERTY

8dac8c4

- Check that fget is deferred- Pass tp_version

Checkpoint LOAD_ATTR_GETATTRIBUTE_OVERRIDEN

d29b3aa

Lock dict in instance_has_key

b0b8102

Lock dict instead of owner when specializing for dict access

85dab0d

Use atomic load for valid bit

581869e

Use _PyDictKeys_StringLookup and lock around it, rather than wasting …

96be738

…out param

Fix cases_generator bug

9afe052

Macros should be treated as terminators when searching for the assignmenttarget of an expression involving PyStackRef_FromPyObjectNew

Fix load

e190a0d

Remove FT_UNIMPLEMENTED

8c78369

All instance loads are complete!

This was referencedJan 3, 2025

Add free threading scaling microbenchmarks#125985

Closed

gh-125985: Fixcmodule_function() scaling benchmark#128460

Merged

markshannon reviewed

Jan 7, 2025

View reviewed changes

Copy link

Member

markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A lot of effort in this PR seems to go intoLOAD_ATTR_WITH_HINT. There is a lot of locking and unlocking and extra machinery to support having a value on the stack between uops.
Is it worth it?

Generally, I think any effort spent onLOAD_ATTR_WITH_HINT is better spent on increasing the fraction of objects that use inlined values and can be handled byLOAD_ATTR_INSTANCE_VALUE.
It is not just thatLOAD_ATTR_INSTANCE_VALUE is faster, it also means that the object is created faster and uses less memory.

Objects/dictobject.c Outdated

		@@ -1129,6 +1129,21 @@ dictkeys_generic_lookup(PyDictObject mp, PyDictKeysObject dk, PyObject *key, P
		return do_lookup(mp, dk, key, hash, compare_generic);
		}

		static Py_hash_t
		check_keys_and_hash(PyDictKeysObject dk, PyObject key)

Copy link

Member

markshannonJan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I can't tell from the name what this is checking.
Maybecheck_dict_is_unicode_only_and_key_has_hash_defined?
Which leads me to the question: why check both of these things in a single function. The two checks appear unrelated.

Copy link

ContributorAuthor

mpageJan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Which leads me to the question: why check both of these things in a single function.

This is factoring out some common code into a helper. I'll split it into two helpers with more descriptive names.

Python/bytecodes.c

		}

		split op(_LOAD_ATTR_INSTANCE_VALUE, (offset/1, owner -- attr, null if (oparg & 1))) {
		PyObject *owner_o = PyStackRef_AsPyObjectBorrow(owner);
		PyObject value_ptr = (PyObject)(((char *)owner_o) + offset);
		PyObject attr_o = value_ptr;
		PyObject attr_o =FT_ATOMIC_LOAD_PTR_ACQUIRE(value_ptr);

Copy link

Member

markshannonJan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

How do we know that the values array is still valid at this point?

Copy link

Contributor

colesburyJan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There's a case that we are not handling correctly here: if, after loadingvalue_ptr, another thread invalidates the inline values, modifies an attribute, and reallocates another object at the same memory address as the previous attribute, we can incorrectly return a value that wasn't the attribute at that offset.

We need to either:

Re-check_PyObject_InlineValues(owner_o)->valid at the end of this handler and clean-up and deopt if it's not still true.
Set the contents of the inlines values array to NULL when we mark it as invalid so that the_Py_TryIncrefCompareStackRef handles this case and correctly deopts.

I think the second options is probably simpler to implement and more efficient given that_LOAD_ATTR_INSTANCE_VALUE is common and invalidating inline values is relatively rare.

Copy link

Member

markshannonJan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Invalidating inline values might not be that rare.
I suggest gathering some stats before deciding how to handle this.

Copy link

Contributor

colesburyJan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

From
https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20250104-3.14.0a3%2B-0cafa97/bm-20250104-azure-x86_64-python-0cafa97932c6574734bb-3.14.0a3%2B-0cafa97-pystats.md:

LOAD_ATTR_INSTANCE_VALUE: 4,922,474,147
Materialize dict (on request) | 4,444,396 | 2.4%
Materialize dict (new key) | 476,375 | 0.3%
Materialize dict (too big) | 4,884 | 0.0%
Materialize dict (str subclass) | 0 | 0.0%

So, LOAD_ATTR_INSTANCE_VALUE is somewhere between 1,000x-10,000x more frequent than invalidating inline values. (Invalidating inline values requires first materializing the dictionary, but materializing the dictionary doesn't always invalidate the inline values).

Python/bytecodes.c

		DEOPT_IF(attr_o == NULL);
		#ifdef Py_GIL_DISABLED
		if (!_Py_TryIncrefCompareStackRef(value_ptr, attr_o, &attr)) {

Copy link

Member

markshannonJan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What does_Py_TryIncrefCompareStackRef do?
Function names need to be self explanatory. The result of this function does not depend on a comparison, but on whether something has been modified. Maybe_Py_TryIncrefIfPointersConsistentStackRef as it only works if the `PyObject ** and PyObject *pointers are consistent.

I know that_Py_TryIncrefCompareStackRef was not introduced in this PR, but it is critical to the understanding of this code.

Copy link

Contributor

colesburyJan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It is an atomic compare-and-incref function, similar to how_Py_atomic_compare_exchange is an atomic compare-and-swap function. If the comparison succeeds, the value is incref'd and1 is returned.0 is returned on failure.

"Compare" is a more appropriate term than "consistent" here.

Copy link

Member

markshannonJan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I still think "Compare" is misleading. The name should say what the function does, not how it does it.

Python/bytecodes.c Outdated

		PyObject *attr_o;
		if (!LOCK_OBJECT(dict)) {
		DEAD(dict);
		POP_DEAD_INPUTS();

Copy link

Member

markshannonJan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This seems rather obscure. I see why you need to popdict, butDEAD(dict); POP_DEAD_INPUTS() seems a rather convoluted way to do this.

Since you adding a new code generator macro, why notPOP(dict)?
I think that most people would expect that an explicitly popped value is no longer live, so there should be no need for a kill and a pop.

Copy link

ContributorAuthor

mpageJan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sure, I can look at that.POP is already taken by a macro inceval_macros.h, but we could usePOP_INPUT.

Copy link

Member

markshannonJan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Aside:
We shouldn't be usingPOP, as the code generator should handle pops. We can removePOP in another PR.

mpage added7 commits

January 8, 2025 15:33

Merge branch 'main' intopythongh-115999-load-attr-instance-merged

db02869

Merge branch 'main' intopythongh-115999-load-attr-instance-merged

dc3ba7f

Split check_keys_and_hash

e07fb62

These are two logically separate operations

Clear inline values upon invalidation

a3f89b7

This avoids a race where another thread invalidates the values, overwritesan attribute stored in the values, and allocates a new object at the addresspresent in the values.

Replace kill/pop_dead_inputs with pop_input

3baa840

Fix formatting

9250e4e

Fix visibility

675eb82

Copy link

ContributorAuthor

mpage commentedJan 10, 2025•
edited
Loading

A lot of effort in this PR seems to go into LOAD_ATTR_WITH_HINT. There is a lot of locking and unlocking and extra machinery to support having a value on the stack between uops.
Is it worth it?

The extra machinery to support having a value on the stack is also necessary forLOAD_ATTR_MODULE, so the only net new effort is locking. Increasing the fraction of objects that use inlined values and can be handled byLOAD_ATTR_INSTANCE_VALUE does sound like useful work, but can be done separately.

mpage requested a review fromcolesbury

January 10, 2025 01:43

colesbury reviewed

Jan 10, 2025

View reviewed changes

Python/bytecodes.c OutdatedShow resolvedHide resolved

Objects/dictobject.c OutdatedShow resolvedHide resolved

mpage added3 commits

January 10, 2025 12:11

Merge branch 'main' intopythongh-115999-load-attr-instance-merged

d24dc3a

Use release instead of seq_cst

10d693d

Make POP_INPUT take the name of the TOS

7ab3ec6

mpage requested a review fromcolesbury

January 10, 2025 21:53

colesbury approved these changes

Jan 10, 2025

View reviewed changes

mpage added the 🔨 test-with-refleak-buildbotsTest PR w/ refleak buildbots; report in status section label

Jan 10, 2025

Copy link

bedevere-bot commentedJan 10, 2025

🤖 New build scheduled with the buildbot fleet by@mpage for commit7ab3ec6 🤖

If you want to schedule another build, you need to add the🔨 test-with-refleak-buildbots label again.

bedevere-bot removed the 🔨 test-with-refleak-buildbotsTest PR w/ refleak buildbots; report in status section label

Jan 10, 2025

mpage added2 commits

January 14, 2025 08:41

Merge branch 'main' intopythongh-115999-load-attr-instance-merged

8c70ea4

Use DECREF_INPUTS() instead of manually closing inputs

3616eab

mpage merged commitb5ee025 intopython:main

Jan 14, 2025

63 checks passed

bedevere-appbot removed the awaiting merge label

Jan 14, 2025

mpage deleted the gh-115999-load-attr-instance-merged branch

January 14, 2025 19:56

nascheme added a commit to nascheme/cpython that referenced this pull request

Apr 29, 2025

Fix for test_type_lookup_mro_reference.

f56453a

Change the unit test case to use `getattr()` so that we avoid thebytecode specializer optimizing the access.  The specializer will callthe `__eq__` method before the unit test expects, causing it to fail.In the 3.14 branch (pythongh-128164) the test is changed in a different wayto avoid the same issue.

nascheme mentioned this pull request

Apr 29, 2025

[3.13] gh-133122: Fix for test_type_lookup_mro_reference.#133127

Merged

nascheme added a commit to nascheme/cpython that referenced this pull request

Apr 29, 2025

Fix for test_type_lookup_mro_reference.

36a74b9

Change the unit test case to use `getattr()` so that we avoid thebytecode specializer optimizing the access.  The specializer will callthe `__eq__` method before the unit test expects, causing it to fail.In the 3.14 branch (pythongh-128164) the test is changed in a different wayto avoid the same issue.

nascheme added a commit that referenced this pull request

Apr 29, 2025

[3.13]gh-133122: Fix for test_type_lookup_mro_reference. (gh-133127)

64dcbb0

Change the unit test case to use `getattr()` so that we avoid thebytecode specializer optimizing the access.  The specializer will callthe `__eq__` method before the unit test expects, causing it to fail.In the 3.14 branch (gh-128164) the test is changed in a different wayto avoid the same issue.

Labels

skip news

5 participants

		}

		split op(_LOAD_ATTR_INSTANCE_VALUE, (offset/1, owner -- attr, null if (oparg & 1))) {
		PyObject *owner_o = PyStackRef_AsPyObjectBorrow(owner);
		PyObject value_ptr = (PyObject)(((char *)owner_o) + offset);
		PyObject attr_o = value_ptr;
		PyObject attr_o =FT_ATOMIC_LOAD_PTR_ACQUIRE(value_ptr);

Movatterモバイル変換

Uh oh!

gh-115999: SpecializeLOAD_ATTR for instance and class receivers in free-threaded builds#128164

gh-115999: SpecializeLOAD_ATTR for instance and class receivers in free-threaded builds#128164

Uh oh!

Conversation

mpage commentedDec 21, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Single-threaded Performance

Scalability

Thread Safety

Stats

Uh oh!

markshannon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpage commentedJan 10, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commentedJan 10, 2025

Uh oh!

Uh oh!

Uh oh!

gh-115999: Specialize`LOAD_ATTR` for instance and class receivers in free-threaded builds#128164

gh-115999: Specialize`LOAD_ATTR` for instance and class receivers in free-threaded builds#128164

mpage commentedDec 21, 2024•
edited
Loading

mpage commentedJan 10, 2025•
edited
Loading