Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32.1k
gh-115999: SpecializeLOAD_ATTR
for instance and class receivers in free-threaded builds#128164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Look up a unicode key in an all unicode keys object along with thekeys version, assigning one if not present.We need a keys version that is consistent with presence of the keyfor use in the guards.
Reading the shared keys version and looking up the key need to be performedatomically. Otherwise, a key inserted after the lookup could race with readingthe version, causing us to incorrectly specialize that no key shadows thedescriptor.
Everything starts out disabled
* Check that the type hasn't changed across lookups* Descr is now an owned ref
…_INST_ATTR_FROM_DICT
- Use atomic load for value- Use _Py_TryIncrefCompareStackRef for incref
- Use atomics and _Py_TryIncrefCompareStackRef in _LOAD_ATTR_SLOT- Pass type version during specialization
- Check that fget is deferred- Pass tp_version
Macros should be treated as terminators when searching for the assignmenttarget of an expression involving PyStackRef_FromPyObjectNew
All instance loads are complete!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
A lot of effort in this PR seems to go intoLOAD_ATTR_WITH_HINT
. There is a lot of locking and unlocking and extra machinery to support having a value on the stack between uops.
Is it worth it?
Generally, I think any effort spent onLOAD_ATTR_WITH_HINT
is better spent on increasing the fraction of objects that use inlined values and can be handled byLOAD_ATTR_INSTANCE_VALUE
.
It is not just thatLOAD_ATTR_INSTANCE_VALUE
is faster, it also means that the object is created faster and uses less memory.
Objects/dictobject.c Outdated
@@ -1129,6 +1129,21 @@ dictkeys_generic_lookup(PyDictObject *mp, PyDictKeysObject* dk, PyObject *key, P | |||
return do_lookup(mp, dk, key, hash, compare_generic); | |||
} | |||
static Py_hash_t | |||
check_keys_and_hash(PyDictKeysObject *dk, PyObject *key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I can't tell from the name what this is checking.
Maybecheck_dict_is_unicode_only_and_key_has_hash_defined
?
Which leads me to the question: why check both of these things in a single function. The two checks appear unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Which leads me to the question: why check both of these things in a single function.
This is factoring out some common code into a helper. I'll split it into two helpers with more descriptive names.
} | ||
split op(_LOAD_ATTR_INSTANCE_VALUE, (offset/1, owner -- attr, null if (oparg & 1))) { | ||
PyObject *owner_o = PyStackRef_AsPyObjectBorrow(owner); | ||
PyObject **value_ptr = (PyObject**)(((char *)owner_o) + offset); | ||
PyObject *attr_o = *value_ptr; | ||
PyObject *attr_o =FT_ATOMIC_LOAD_PTR_ACQUIRE(*value_ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
How do we know that the values array is still valid at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There's a case that we are not handling correctly here: if, after loadingvalue_ptr
, another thread invalidates the inline values, modifies an attribute, and reallocates another object at the same memory address as the previous attribute, we can incorrectly return a value that wasn't the attribute at that offset.
We need to either:
- Re-check
_PyObject_InlineValues(owner_o)->valid
at the end of this handler and clean-up and deopt if it's not still true. - Set the contents of the inlines values array to NULL when we mark it as invalid so that the
_Py_TryIncrefCompareStackRef
handles this case and correctly deopts.
I think the second options is probably simpler to implement and more efficient given that_LOAD_ATTR_INSTANCE_VALUE
is common and invalidating inline values is relatively rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Invalidating inline values might not be that rare.
I suggest gathering some stats before deciding how to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LOAD_ATTR_INSTANCE_VALUE: 4,922,474,147
Materialize dict (on request) | 4,444,396 | 2.4%
Materialize dict (new key) | 476,375 | 0.3%
Materialize dict (too big) | 4,884 | 0.0%
Materialize dict (str subclass) | 0 | 0.0%
So, LOAD_ATTR_INSTANCE_VALUE is somewhere between 1,000x-10,000x more frequent than invalidating inline values. (Invalidating inline values requires first materializing the dictionary, but materializing the dictionary doesn't always invalidate the inline values).
DEOPT_IF(attr_o == NULL); | ||
#ifdef Py_GIL_DISABLED | ||
if (!_Py_TryIncrefCompareStackRef(value_ptr, attr_o, &attr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What does_Py_TryIncrefCompareStackRef
do?
Function names need to be self explanatory. The result of this function does not depend on a comparison, but on whether something has been modified. Maybe_Py_TryIncrefIfPointersConsistentStackRef
as it only works if the `PyObject ** and PyObject *pointers are consistent.
I know that_Py_TryIncrefCompareStackRef
was not introduced in this PR, but it is critical to the understanding of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It is an atomic compare-and-incref function, similar to how_Py_atomic_compare_exchange
is an atomic compare-and-swap function. If the comparison succeeds, the value is incref'd and1
is returned.0
is returned on failure.
"Compare" is a more appropriate term than "consistent" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I still think "Compare" is misleading. The name should say what the function does, not how it does it.
Python/bytecodes.c Outdated
PyObject *attr_o; | ||
if (!LOCK_OBJECT(dict)) { | ||
DEAD(dict); | ||
POP_DEAD_INPUTS(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This seems rather obscure. I see why you need to popdict
, butDEAD(dict); POP_DEAD_INPUTS()
seems a rather convoluted way to do this.
Since you adding a new code generator macro, why notPOP(dict)
?
I think that most people would expect that an explicitly popped value is no longer live, so there should be no need for a kill and a pop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sure, I can look at that.POP
is already taken by a macro inceval_macros.h
, but we could usePOP_INPUT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Aside:
We shouldn't be usingPOP
, as the code generator should handle pops. We can removePOP
in another PR.
These are two logically separate operations
This avoids a race where another thread invalidates the values, overwritesan attribute stored in the values, and allocates a new object at the addresspresent in the values.
mpage commentedJan 10, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
The extra machinery to support having a value on the stack is also necessary for |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
bedevere-bot commentedJan 10, 2025
b5ee025
intopython:mainUh oh!
There was an error while loading.Please reload this page.
Change the unit test case to use `getattr()` so that we avoid thebytecode specializer optimizing the access. The specializer will callthe `__eq__` method before the unit test expects, causing it to fail.In the 3.14 branch (pythongh-128164) the test is changed in a different wayto avoid the same issue.
Change the unit test case to use `getattr()` so that we avoid thebytecode specializer optimizing the access. The specializer will callthe `__eq__` method before the unit test expects, causing it to fail.In the 3.14 branch (pythongh-128164) the test is changed in a different wayto avoid the same issue.
Change the unit test case to use `getattr()` so that we avoid thebytecode specializer optimizing the access. The specializer will callthe `__eq__` method before the unit test expects, causing it to fail.In the 3.14 branch (gh-128164) the test is changed in a different wayto avoid the same issue.
Uh oh!
There was an error while loading.Please reload this page.
This PR finishes specialization for
LOAD_ATTR
in the free-threaded build by adding support for class and instance receivers.The bulk of it is dedicated to making the specialized instructions and the specialization logic thread-safe. This consists of using atomics / locks in the appropriate places, avoiding races in specialization related to reloading versions, and ensuring that the objects stored in inline caches remain valid when accessed (by only storing deferred objects). See the section on "Thread Safety" below for more details.
Additionally, making this work required a few unrelated changes to fix existing bugs or work around differences between the two builds that results from only storing deferred values (which causes in specialization failures in the free-threaded build when a value that would be stored in the cache is not deferred):
PyStackRef_FromPyObjectNew
.test_descr.MiscTests.test_type_lookup_mro_reference
to work when specialization fails (and also be a behavorial test).test_capi.test_type.TypeTests.test_freeze_meta
when running refleaks tests on free-threaded builds. Specialization failure triggers an existing bug.Single-threaded Performance
We're leaving a bit of perf on the table by only storing deferred objects: we can't specialize attribute lookups that resolve to class attributes (e.g. counters, settings). I haven't measured how much perf we're giving up, but I'd like to address that in a separate PR.
Scalability
The
object_cfunction
andpymethod
benchmarks are improved (1.4x slower -> 14.3x faster, 1.8x slower -> 13.0x faster, respectively). Other benchmarks appear unchanged.I would expect that
cmodule_function
would also improve, but it looks like the benchmark is bottlenecked on increfing theint.__floor__
method that is returned from the call to_PyObject_LookupSpecial
inmath_floor
(the incref happens in_PyType_LookupRef
, which is called byPyObject_LookupSpecial
):cpython/Modules/mathmodule.c
Line 1178 in3879ca0
Raw numbers:
Thread Safety
Thread safety of specialized instructions is addressed in a few different ways:
_LOAD_ATTR_WITH_HINT
.Thread safety of specialization is addressed using similar techniques:
Stats
Following theinstructions in the comment preceding
specialize_attr_loadclassattr
, I collected stats for the default build for both this PR and its merge base using./python -m test_typing test_re test_dis test_zlib
and compared them usingTools/scripts/summarize_stats.py
. The results forLOAD_ATTR
are nearly identical and are consistent with results from comparing the merge base against itself:--disable-gil
builds #115999