NotificationsYou must be signed in to change notification settings
Fork32k
Star67.3k

gh-127266: avoid data races when updating type slots#131174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

nascheme merged 54 commits intopython:mainfromnascheme:gh-127266-type-slots-ts

Apr 28, 2025

Merged

gh-127266: avoid data races when updating type slots#131174

nascheme merged 54 commits intopython:mainfromnascheme:gh-127266-type-slots-ts

Apr 28, 2025

Conversation

Copy link

Member

nascheme commentedMar 13, 2025•
edited
Loading

Avoid data races when updating type slots and type flags by "stopping-the-world" first. This is only needed if the type is potentially exposed to multiple threads. There are a couple places this happens:

type_setattro() when the assigned item is a "dunder" method that corresponds to one or more slot
assigning to__bases__
_abc.c calling_PyType_SetFlags() to set the collection flags (SEQUENCE or MAPPING)

Two examples of code that causes this:dataclasses will assign dunder methods after the type object has been created, when the decorator runs. That triggers this and since the slots are assigned one-by-one, we stop for each assignment. Not great. Another example is unittestMagicMock. Again, it assigns dunder methods after the class is defined.

In order to assess the performance impact of this change, I've measured the following things. The time it takes to run the unit test suite does not seem significantly worse, even when running tests serially using-j 1. Comparing the pyperformance results before and after this change shows no significant change. See link below for the chart.

Running with TSAN enabled and running the--tsan and--tsan-parallel --parallel-threads=4 tests show no warnings. I also have a modified pyperformance test suite that I can run in parallel and that shows no TSAN warnings related to this as well.

Additional changes after feedback from Sam:

Replace the critical sections that useTYPE_LOCK with a simple mutex instead. To avoid dead-locks, the code was analyzed for re-entrancy and adjusted as needed. There are a couple places where the mutex is dropped before calling something that's possibly re-entrant.
Remove some locking where it is not needed. Slots can only be changed when the world is stopped (e.g. readingtp_mro no longer needs to hold the lock).
Createtype_lookup_ex() as a lower-level version of _PyType_LookupStackRefAndVersion that optionally acquires theTYPE_LOCK or requires that it is already held.
Add a new type flag that is used for debug enabled builds. It tries to check if a type is potentially exposed to multiple threads (and therefore needs stop-the-world before modifying slots). This flag could perhaps be removed before the final merge of this PR (since type flag bits are kind of in short supply).

Benchmark using pyperformance

Issue:Type slots are not thread-safe in free-threaded builds #127266

wip: update type slots, stop-the-world

0e995ad

bedevere-appbot mentioned this pull request

Mar 13, 2025

Type slots are not thread-safe in free-threaded builds#127266

Closed

nascheme added2 commits

March 13, 2025 15:03

Remove unneeded atomics for tp_flags.

b173f17

Use stop-the-world for tp_flag changes too.

d4ce112

Copy link

Contributor

kumaraditya303 commentedMar 14, 2025•
edited
Loading

Two examples that cause this: dataclasses will assign dunder methods after the type object has been created. That triggers this and since the slots are assigned one-by-one, we stop for each assignment. Not great. Another example is unittest MockMock. Again, it assigns dunder methods after the class is defined.

We can avoid stop the world pauses in both cases by checking if the type is uniquely ref by the current thread, the type object is newly created here and not exposed to other threads so we can assign without stopping the world.

nascheme added4 commits

March 14, 2025 11:54

Remove 'world_stops' and 'sys._get_world_stops'.

44eb332

This would be better to be part of the pystats info.

Improve code comments.

d132fab

Remove TSAN suppressions that seem unneeded.

658bcd5

Add NEWS file.

ef2f07b

nascheme marked this pull request as ready for review

March 14, 2025 21:37

nascheme requested a review frommarkshannon as acode owner

March 14, 2025 21:37

bedevere-appbot added the awaiting core review label

Mar 14, 2025

nascheme requested review frommpage andcolesbury

March 15, 2025 18:11

kumaraditya303 reviewed

Mar 16, 2025

View reviewed changes

Objects/typeobject.c OutdatedShow resolvedHide resolved

colesbury reviewed

Mar 17, 2025

View reviewed changes

Copy link

Contributor

colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm concerned about atomicity here (see comment below).

Objects/typeobject.c OutdatedShow resolvedHide resolved

Objects/typeobject.c Outdated

Comment on lines 11314 to 11315

		PyInterpreterState *interp = _PyInterpreterState_GET();
		_PyEval_StopTheWorld(interp);

Copy link

Contributor

colesburyMar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

_PyEval_StopTheWorld() is always defined and is a no-op in the default build.

Objects/typeobject.c OutdatedShow resolvedHide resolved

Copy link

Contributor

colesbury commentedMar 17, 2025

We might be able to makeBEGIN_TYPE_LOCK() a regular mutex acquisition instead of a critical section, but we'll have to be careful that nothing betweenBEGIN_TYPE_LOCK() andEND_TYPE_LOCK() leads to reentrancy.

nascheme added2 commits

March 26, 2025 00:22

Use mutex rather than critical sections.

ce8536d

For TYPE_LOCK, replace critical sections with a simple mutex.  A numberof changes were needed to avoid deadlocking on reentrancy (trying tore-acquire the mutex). Split _PyType_LookupRefAndVersion() into a lockedand unlocked version.  Remove atomic operations where they are notrequired. Remove some cases of TYPE_LOCK being held that is notrequired.

Merge 'origin/main' intopythongh-127266-type-slots-ts

b97a4b4

nascheme marked this pull request as draft

March 26, 2025 07:49

bedevere-appbot removed the awaiting core review label

Mar 26, 2025

Fix non-debug build.

ca00e74

Copy link

MemberAuthor

nascheme commentedMar 26, 2025

I've convertedTYPE_LOCK from being used as a critical section to being a regular mutex. There was a number of cases of reentrancy and so things got complex. I think I nearly have that sorted out now. Themro_invoke() function needs work yet. At least, unit tests pass without deadlocking. One somewhat ugly part yet is that_PyType_LookupRefAndVersion() got split into two versions with only slight differences between the functions (taking lock if needed vs caller already holding lock or the lock not being needed). Perhaps I can find a way to refactor it to avoid code duplication.

nascheme added3 commits

March 26, 2025 15:45

Improve comments.

6db4542

Merge 'origin/main' intopythongh-127266-type-slots-ts

398ac14

Avoid unused function warning.

75d6b71

nascheme marked this pull request as ready for review

March 27, 2025 15:49

bedevere-appbot added the awaiting core review label

Mar 27, 2025

nascheme requested a review fromcolesbury

March 27, 2025 15:50

colesbury reviewed

Mar 27, 2025

View reviewed changes

Copy link

Contributor

colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice!

I think there's still too much mixing of the type lock and the stop the world. I think data should either be protected by the type lock OR by stop-the-world, but not both.

From reading through this:

type lock:

tp_version_tag
_spec_cache

stop the world:

tp_flags (modifications)
bases (modifications)

Additionally, there's a few functions that are named with_lock_held but are called in a stop-the-world pause (not with the type lock held), which I found a bit confusing.

Objects/typeobject.c OutdatedShow resolvedHide resolved

Tools/tsan/suppressions_free_threading.txt OutdatedShow resolvedHide resolved

Include/object.h OutdatedShow resolvedHide resolved

Objects/typeobject.c Outdated

Comment on lines 3552 to 3554

		types_mutex_unlock();
		new_mro = mro_invoke(type); /* might cause reentrance */
		types_mutex_lock();

Copy link

Contributor

colesburyMar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This doesn't seem right to me. This can be called with both the type lock held and the world stopped. Ifmro_invoke can call arbitrary code, it needs to start the world around the call.

Copy link

MemberAuthor

naschemeMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, that can call a custommro() method on the meta class. The only thing in Lib that does that is atest_descr test but obviously we need to handle it.

Copy link

MemberAuthor

naschemeMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I believe this has been resolved. Insidemro_invoke() I either start-the-world before callingmro() (if it's custom) or I unlock the mutex. The latter case happens duringtype_ready().

Objects/typeobject.c OutdatedShow resolvedHide resolved

Copy link

MemberAuthor

nascheme commentedApr 21, 2025

I'm not convinced that the switch of the type lock from a critical section to a plain mutex is profitable here.

Okay, I've reverted back to using a critical section forTYPE_LOCK.

nascheme added7 commits

April 21, 2025 16:45

Invalidate type cache before calling watchers.

a81e9e3

This avoids the watcher callback potentially using invalid cacheentries.

Fixes for type_modified_unlocked().

f5df0c3

* Since type_modified_unlocked() can be called with the world stopped  then we need to re-start if making re-entrant calls (like calling  type watcher callbacks).* Re-order code before type_modified_unlocked().  This fixes a potential  bug if called functions end up re-assigning the type version tag.  If  they do, the cache can contain invalid entries.  We should do the  dictionary update first then set the version to zero.* Replace 'is_new_type' with 'world_stopped' boolean parameter.  This  seems a bit more clear.

Major re-work, TYPE_LOCK protects more things.

7db281c

Rather than trying to update tp_base, tp_bases, and tp_mro while theworld is stopped, hold TYPE_LOCK when updating them.  This is much moresimilar to how the base version of the code works and requires manyfewer changes.  Using stop-the-world turned out to be difficult to makework correctly, mostly due to custom mro() methods but also for someother reasons (for example, functions that potentially re-enter theinterpreter).This re-work requires that we can call stop-the-world inside a criticalsection and have the mutex still held after re-starting.  This seems towork and I've added asserts to confirm it.  Stop-the-world is now onlyused when updating either tp_flags or other type slots and not the threetype members listed above.

Merge 'origin/main' intopythongh-127266-type-slots-ts

da2a0ad

Fix non-debug build.

986f23a

Revert unneeded code changes.

c404ed4

Merge branch 'origin/main' intopythongh-127266-type-slots-ts

55af4ba

Copy link

MemberAuthor

nascheme commentedApr 23, 2025•
edited
Loading

Major rework, reverting back to usingTYPE_LOCK and critical sections to protect tp_mro, tp_bases, tp_base. In a previous version of this PR, I was making updates to those only while the world was stopped. Because we are so limited on what APIs you can call while the world is stopped it is hard to make it work without major overhaul to how types work. Themro() method on the metatype is particularly troublesome but there are other similar problems.

Here is a summary of what's changed vs the main branch:

Update tp_flags and type slots (other than tp_base, tp_bases, and tp_mro) only while the world is stopped. This allows us to read those type fields without atomics and without locking.
Add_Py_TYPE_REVEALED_FLAG as a debug build check to help prevent code that unsafely writes to a type (missing lock or stop-the-world).
Fix a data race with the pname/ptrs "little cache" forresolve_slotdups(). That cache is not safe in the free-threaded build so disable it.
A few code cleanup things that are not strictly required but leftover from the previous version of this PR. I think they improve the code so I left them in.

Updated pyperformance run.

nascheme requested a review fromcolesbury

April 23, 2025 20:18

nascheme added2 commits

April 23, 2025 17:00

Restore comment

0eb77da

Revert more changes.

16f15b2

These are not required as part of this PR and can be reviewedseparately.

colesbury approved these changes

Apr 24, 2025

View reviewed changes

Copy link

Contributor

colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This LGTM once the issue withtest_opcache is resolved

bedevere-appbot added awaiting merge and removed awaiting core review labels

Apr 24, 2025

nascheme added2 commits

April 24, 2025 19:17

Merge 'origin/main' intopythongh-127266-type-slots-ts

0c328cc

Reduce item list size for a few tests.

64547e9

Now that these slot updates use stop-the-world, these two tests arequite a lot slower.  Reduce size of the items list from 1000 to 100.

kumaraditya303 reviewed

Apr 25, 2025

View reviewed changes

Objects/typeobject.c OutdatedShow resolvedHide resolved

kumaraditya303 reviewed

Apr 25, 2025

View reviewed changes

Objects/typeobject.c OutdatedShow resolvedHide resolved

kumaraditya303 reviewed

Apr 25, 2025

View reviewed changes

Objects/typeobject.c OutdatedShow resolvedHide resolved

nascheme added2 commits

April 28, 2025 12:56

Merge 'origin/main' intopythongh-127266-type-slots-ts

fff1bd2

Minor code tidy.

5672352

naschemeenabled auto-merge (squash)

April 28, 2025 20:26

nascheme merged commite414a2d intopython:main

Apr 28, 2025

48 checks passed

bedevere-appbot removed the awaiting merge label

Apr 28, 2025

sergey-miryanov mentioned this pull request

Apr 29, 2025

gh-132042: Prebuild mro_dict for find_name_in_mro to speedup class creation#132618

Open

nascheme added a commit to nascheme/cpython that referenced this pull request

Apr 29, 2025

Revertpythongh-127266: avoid data races when updating type slots (py…

5292ead

…thongh-131174)This is triggering deadlocks in test_opcache.

nascheme mentioned this pull request

Apr 29, 2025

Deadlock in test_opcache with gh-131174 applied#133130

Closed

nascheme added a commit that referenced this pull request

Apr 29, 2025

Revertgh-127266: avoid data races when updating type slots (gh-131174)…

eecafc3

… (gh-133129)This is triggering deadlocks in test_opcache.  SeeGH-133130 for stack trace.

nascheme mentioned this pull request

Apr 29, 2025

gh-127266: avoid data races when updating type slots v2#133177

Merged

Labels

topic-free-threading

4 participants

Movatterモバイル変換

Uh oh!

gh-127266: avoid data races when updating type slots#131174

gh-127266: avoid data races when updating type slots#131174

Uh oh!

Conversation

nascheme commentedMar 13, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

kumaraditya303 commentedMar 14, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

colesburyMar 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

colesbury commentedMar 17, 2025

Uh oh!

nascheme commentedMar 26, 2025

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

colesburyMar 27, 2025

Choose a reason for hiding this comment

Uh oh!

naschemeMar 28, 2025

Choose a reason for hiding this comment

Uh oh!

naschemeMar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nascheme commentedApr 21, 2025

Uh oh!

nascheme commentedApr 23, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nascheme commentedMar 13, 2025•
edited
Loading

kumaraditya303 commentedMar 14, 2025•
edited
Loading

nascheme commentedApr 23, 2025•
edited
Loading