Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
gh-115999: Implement thread-local bytecode and enable specialization forBINARY_OP
#123926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
- Fix a few places where we were not using atomics to (de)instrument opcodes.- Fix a few places where we weren't using atomics to reset adaptive counters.- Remove some redundant non-atomic resets of adaptive counters that presumably snuck as merge artifacts ofpython#118064 andpython#117144 landing close together.
…entation using atomics
Read the opcode atomically, the interpreter may be specializing it
@markshannon - Would you take a look at this, please? |
I'm still concerned about not counting the tlbc memory blocks in the refleaks test. Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks? |
!buildbot nogil refleak |
bedevere-bot commentedOct 24, 2024
🤖 New build scheduled with the buildbot fleet by@mpage for commit07f9140 🤖 The command will test the builders whose names match following regular expression: The builders matched are:
|
@markshannon - That would work, but I opted for clearing the cached TLBC for threads that aren't currently in use when we clear other internal caches. This should still catch leaks, doesn't require modifying |
Uh oh!
There was an error while loading.Please reload this page.
# code objects is a large fraction of the total number of | ||
# references, this can cause the total number of allocated | ||
# blocks to exceed the total number of references. | ||
if not support.Py_GIL_DISABLED: |
markshannonOct 29, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Now that we can free the unused tlbcs, can we replace this withsys._clear_internal_caches()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Unfortunately, no. It seems to be very sensitive to which kinds of objects are on the heap as well as the number of non reference counted allocations (blocks) per object. With the introduction of TLBC there is at least one additional block allocated per code object that is not reference counted, the _PyCodeArray, which is present even if we free the unused TLBCs. Its presence is enough to trigger the assertion.
This assertion feels pretty brittle and I'd be in favor of removing it, but that's probably worth doing in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Maybe replace it with a more meaningful test rather than remove it. But in another PR.
markshannon left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looks good.
One question. Can we prefix the test for leaking blocks withsys._clear_internal_caches()
instead of making it conditional on not using free-threading?
mpage commentedOct 29, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@markshannon - Unfortunately that doesn't help. See my reply inline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I still have concerns about memory use, but we can iterate on that in subsequent PRs.
JIT failures appear on main and are unrelated to this PR |
2e95c5b
intopython:mainUh oh!
There was an error while loading.Please reload this page.
After this PR was merged, |
Looks like the problem is only in --enable-shared builds, and it's because we're now looking up _PyInterpreterFrame too early (before the .so file is loaded). I'll have a fix in a few minutes. |
PR#126440 should fix the failure. |
…ation for `BINARY_OP` (python#123926)Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.
…ation for `BINARY_OP` (python#123926)Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.
Uh oh!
There was an error while loading.Please reload this page.
This PR implements the foundational work necessary for making the specializing interpreter thread-safe in free-threaded builds and enables specialization for
BINARY_OP
as an end-to-end example. To enable future incremental work, specialization can now be toggled on a per-family basis. Subsequent PRs will enable specialization in free-threaded builds for the remaining families.Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the
co_tlbc
array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in allco_tlbc
arrays at thread creation and release the index at thread destruction. The first entry in everyco_tlbc
array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.Thread-local bytecode can be disabled at runtime by providing either
-X tlbc=0
orPYTHON_TLBC=0
. Disabling thread-local bytecode also disables specialization.Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.
--disable-gil
builds #115999