Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
gh-127266: avoid data races when updating type slots#131174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
kumaraditya303 commentedMar 14, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
We can avoid stop the world pauses in both cases by checking if the type is uniquely ref by the current thread, the type object is newly created here and not exposed to other threads so we can assign without stopping the world. |
This would be better to be part of the pystats info.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm concerned about atomicity here (see comment below).
Uh oh!
There was an error while loading.Please reload this page.
Objects/typeobject.c Outdated
PyInterpreterState *interp = _PyInterpreterState_GET(); | ||
_PyEval_StopTheWorld(interp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
_PyEval_StopTheWorld()
is always defined and is a no-op in the default build.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
We might be able to make |
For TYPE_LOCK, replace critical sections with a simple mutex. A numberof changes were needed to avoid deadlocking on reentrancy (trying tore-acquire the mutex). Split _PyType_LookupRefAndVersion() into a lockedand unlocked version. Remove atomic operations where they are notrequired. Remove some cases of TYPE_LOCK being held that is notrequired.
I've converted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice!
I think there's still too much mixing of the type lock and the stop the world. I think data should either be protected by the type lock OR by stop-the-world, but not both.
From reading through this:
type lock:
- tp_version_tag
- _spec_cache
stop the world:
- tp_flags (modifications)
- bases (modifications)
Additionally, there's a few functions that are named with_lock_held
but are called in a stop-the-world pause (not with the type lock held), which I found a bit confusing.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Objects/typeobject.c Outdated
types_mutex_unlock(); | ||
new_mro = mro_invoke(type); /* might cause reentrance */ | ||
types_mutex_lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This doesn't seem right to me. This can be called with both the type lock held and the world stopped. Ifmro_invoke
can call arbitrary code, it needs to start the world around the call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, that can call a custommro()
method on the meta class. The only thing in Lib that does that is atest_descr
test but obviously we need to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I believe this has been resolved. Insidemro_invoke()
I either start-the-world before callingmro()
(if it's custom) or I unlock the mutex. The latter case happens duringtype_ready()
.
Uh oh!
There was an error while loading.Please reload this page.
Okay, I've reverted back to using a critical section for |
This avoids the watcher callback potentially using invalid cacheentries.
* Since type_modified_unlocked() can be called with the world stopped then we need to re-start if making re-entrant calls (like calling type watcher callbacks).* Re-order code before type_modified_unlocked(). This fixes a potential bug if called functions end up re-assigning the type version tag. If they do, the cache can contain invalid entries. We should do the dictionary update first then set the version to zero.* Replace 'is_new_type' with 'world_stopped' boolean parameter. This seems a bit more clear.
Rather than trying to update tp_base, tp_bases, and tp_mro while theworld is stopped, hold TYPE_LOCK when updating them. This is much moresimilar to how the base version of the code works and requires manyfewer changes. Using stop-the-world turned out to be difficult to makework correctly, mostly due to custom mro() methods but also for someother reasons (for example, functions that potentially re-enter theinterpreter).This re-work requires that we can call stop-the-world inside a criticalsection and have the mutex still held after re-starting. This seems towork and I've added asserts to confirm it. Stop-the-world is now onlyused when updating either tp_flags or other type slots and not the threetype members listed above.
nascheme commentedApr 23, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Major rework, reverting back to using Here is a summary of what's changed vs the main branch:
|
These are not required as part of this PR and can be reviewedseparately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This LGTM once the issue withtest_opcache
is resolved
Now that these slot updates use stop-the-world, these two tests arequite a lot slower. Reduce size of the items list from 1000 to 100.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
e414a2d
intopython:mainUh oh!
There was an error while loading.Please reload this page.
…thongh-131174)This is triggering deadlocks in test_opcache.
Uh oh!
There was an error while loading.Please reload this page.
Avoid data races when updating type slots and type flags by "stopping-the-world" first. This is only needed if the type is potentially exposed to multiple threads. There are a couple places this happens:
type_setattro()
when the assigned item is a "dunder" method that corresponds to one or more slot__bases__
_abc.c
calling_PyType_SetFlags()
to set the collection flags (SEQUENCE or MAPPING)Two examples of code that causes this:
dataclasses
will assign dunder methods after the type object has been created, when the decorator runs. That triggers this and since the slots are assigned one-by-one, we stop for each assignment. Not great. Another example is unittestMagicMock
. Again, it assigns dunder methods after the class is defined.In order to assess the performance impact of this change, I've measured the following things. The time it takes to run the unit test suite does not seem significantly worse, even when running tests serially using
-j 1
. Comparing the pyperformance results before and after this change shows no significant change. See link below for the chart.Running with TSAN enabled and running the
--tsan
and--tsan-parallel --parallel-threads=4
tests show no warnings. I also have a modified pyperformance test suite that I can run in parallel and that shows no TSAN warnings related to this as well.Additional changes after feedback from Sam:
TYPE_LOCK
with a simple mutex instead. To avoid dead-locks, the code was analyzed for re-entrancy and adjusted as needed. There are a couple places where the mutex is dropped before calling something that's possibly re-entrant.tp_mro
no longer needs to hold the lock).type_lookup_ex()
as a lower-level version of_PyType_LookupStackRefAndVersion
that optionally acquires theTYPE_LOCK
or requires that it is already held.Benchmark using pyperformance