NotificationsYou must be signed in to change notification settings
Fork32k
Star67.3k

gh-115999: Implement thread-local bytecode and enable specialization for`BINARY_OP`#123926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

mpage merged 84 commits intopython:mainfrommpage:gh-115999-thread-local-bytecode

Nov 4, 2024

Merged

gh-115999: Implement thread-local bytecode and enable specialization for`BINARY_OP`#123926

mpage merged 84 commits intopython:mainfrommpage:gh-115999-thread-local-bytecode

Nov 4, 2024

Conversation

Copy link

Contributor

mpage commentedSep 10, 2024•
edited
Loading

This PR implements the foundational work necessary for making the specializing interpreter thread-safe in free-threaded builds and enables specialization forBINARY_OP as an end-to-end example. To enable future incremental work, specialization can now be toggled on a per-family basis. Subsequent PRs will enable specialization in free-threaded builds for the remaining families.

Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in theco_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in allco_tlbc arrays at thread creation and release the index at thread destruction. The first entry in everyco_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.

Thread-local bytecode can be disabled at runtime by providing either-X tlbc=0 orPYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.

Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.

Issue:Make the specializing interpreter thread-safe in--disable-gil builds #115999

mpage added30 commits

September 10, 2024 13:24

Assign threads indices into bytecode copies

776a1e1

Replace most usage of PyCode_CODE

2b40870

Get bytecode copying working

344d7ad

Refactor remove_tools

f203d00

Refactor remove_line_tools

82b456a

Instrument thread-local bytecode

b021704

Use locks for instrumentation

aea69c5

Add ifdef guards for each specialization family

552277d

Specialize BINARY_OP

50a6089

Limit the amount of memory consumed by bytecode copies

3f1d941

Make thread-local bytecode limits user configurable

7d2eb27

Fix a few data races when (de)instrumenting opcodes

d5476b9

- Fix a few places where we were not using atomics to (de)instrument  opcodes.- Fix a few places where we weren't using atomics to reset adaptive  counters.- Remove some redundant non-atomic resets of adaptive counters that  presumably snuck as merge artifacts ofpython#118064  andpython#117144 landing close  together.

Make branch taken recording thread-safe

e3b367a

Lock thread-local bytecode when specializing

b2375bf

Load bytecode on RESUME_CHECK

2707f8e

Load tlbc on generator.throw()

3fdcb28

Use tlbc instead of thread_local_bytecode

4a55ce5

Use tlbc everywhere

8b3ff60

Explicitly manage tlbc state

862afa1

Refactor API for fetching tlbc

0b4d952

Add unit tests

7795e99

Fix initconfig in default build

693a4cc

Fix instrumentation in default build

b43531e

Synchronize bytecode modifications between specialization and instrum…

9025f43

…entation using atomics

Add a high-level comment

c44c7d9

Fix unused variable warning in default build

e2a6656

Fix test_config in free-threaded builds

e6513d1

Fix formatting

a18396f

Remove comment

81fe1a2

Fix data race in _PyInstruction_GetLength

837645e

Read the opcode atomically, the interpreter may be specializing it

Copy link

ContributorAuthor

mpage commentedOct 22, 2024

@markshannon - Would you take a look at this, please?

Copy link

Member

markshannon commentedOct 23, 2024

I'm still concerned about not counting the tlbc memory blocks in the refleaks test.

Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks?

mpage added3 commits

October 23, 2024 11:36

Merge branch 'main' intopythongh-115999-thread-local-bytecode

176b24e

Clear TLBC when other caches are cleared

c107495

Remove _get_tlbc_blocks

07f9140

Copy link

ContributorAuthor

mpage commentedOct 24, 2024

!buildbot nogil refleak

Copy link

bedevere-bot commentedOct 24, 2024

🤖 New build scheduled with the buildbot fleet by@mpage for commit07f9140 🤖

The command will test the builders whose names match following regular expression:nogil refleak

The builders matched are:

AMD64 CentOS9 NoGIL Refleaks PR
AMD64 Fedora Rawhide NoGIL refleaks PR
aarch64 Fedora Rawhide NoGIL refleaks PR
PPC64LE Fedora Rawhide NoGIL refleaks PR

Copy link

ContributorAuthor

mpage commentedOct 24, 2024

I'm still concerned about not counting the tlbc memory blocks in the refleaks test.
Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks?

@markshannon - That would work, but I opted for clearing the cached TLBC for threads that aren't currently in use when we clear other internal caches. This should still catch leaks, doesn't require modifyingrefleaks.py, and is the same approach we use for tier2. Please have a look.

Yhg1s reviewed

Oct 25, 2024

View reviewed changes

Lib/test/test_sys.pyShow resolvedHide resolved

markshannon reviewed

Oct 29, 2024

View reviewed changes

Lib/test/test_sys.py

		# code objects is a large fraction of the total number of
		# references, this can cause the total number of allocated
		# blocks to exceed the total number of references.
		if not support.Py_GIL_DISABLED:

Copy link

Member

markshannonOct 29, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Now that we can free the unused tlbcs, can we replace this withsys._clear_internal_caches()?

Copy link

ContributorAuthor

mpageOct 29, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Unfortunately, no. It seems to be very sensitive to which kinds of objects are on the heap as well as the number of non reference counted allocations (blocks) per object. With the introduction of TLBC there is at least one additional block allocated per code object that is not reference counted, the _PyCodeArray, which is present even if we free the unused TLBCs. Its presence is enough to trigger the assertion.

This assertion feels pretty brittle and I'd be in favor of removing it, but that's probably worth doing in a separate PR.

Copy link

Member

markshannonOct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe replace it with a more meaningful test rather than remove it. But in another PR.

markshannon reviewed

Oct 29, 2024

View reviewed changes

Copy link

Member

markshannon left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good.

One question. Can we prefix the test for leaking blocks withsys._clear_internal_caches() instead of making it conditional on not using free-threading?

Copy link

ContributorAuthor

mpage commentedOct 29, 2024•
edited
Loading

One question. Can we prefix the test for leaking blocks withsys._clear_internal_caches() instead of making it conditional on not using free-threading?

@markshannon - Unfortunately that doesn't help. See my reply inline.

markshannon self-requested a review

October 29, 2024 16:54

markshannon approved these changes

Oct 29, 2024

View reviewed changes

Copy link

Member

markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I still have concerns about memory use, but we can iterate on that in subsequent PRs.

mpage added4 commits

October 30, 2024 09:55

Merge branch 'main' intopythongh-115999-thread-local-bytecode

4cbe237

Rename _PyCode_InitCounters back to _PyCode_Quicken

38ff315

We are quickening LOAD_CONST now.

Merge branch 'main' intopythongh-115999-thread-local-bytecode

338f7e5

Merge branch 'main' intopythongh-115999-thread-local-bytecode

bcd1bb2

Copy link

ContributorAuthor

mpage commentedNov 4, 2024

JIT failures appear on main and are unrelated to this PR

mpage merged commit2e95c5b intopython:main

Nov 4, 2024

51 of 57 checks passed

bedevere-appbot removed the awaiting merge label

Nov 4, 2024

mpage deleted the gh-115999-thread-local-bytecode branch

November 4, 2024 19:14

Copy link

Member

encukou commentedNov 5, 2024

After this PR was merged,test_gdb started failing; see e.g.https://buildbot.python.org/#/builders/506/builds/9149
Do you think it's possible to fix this in a day?

Copy link

Member

Yhg1s commentedNov 5, 2024

Looks like the problem is only in --enable-shared builds, and it's because we're now looking up _PyInterpreterFrame too early (before the .so file is loaded). I'll have a fix in a few minutes.

Copy link

Member

Yhg1s commentedNov 5, 2024

PR#126440 should fix the failure.

Yhg1s mentioned this pull request

Nov 6, 2024

gh-115999: Add free-threaded specialization for COMPARE_OP#126410

Merged

devdanzin mentioned this pull request

Nov 10, 2024

_interpreters is not thread safe on the free-threaded build#126644

Closed

ZeroIntensity mentioned this pull request

Nov 11, 2024

gh-126644: Fix various thread safety issues in_interpreters#126696

Closed

picnixz pushed a commit to picnixz/cpython that referenced this pull request

Dec 8, 2024

pythongh-115999: Implement thread-local bytecode and enable specializ…

6d75ff7

…ation for `BINARY_OP` (python#123926)Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.

ebonnal pushed a commit to ebonnal/cpython that referenced this pull request

Jan 12, 2025

pythongh-115999: Implement thread-local bytecode and enable specializ…

d363da4

…ation for `BINARY_OP` (python#123926)Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.