Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-134761: Use deferred reference counting forthreading concurrency primitives#134762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ZeroIntensity wants to merge6 commits intopython:main
base:main
Choose a base branch
Loading
fromZeroIntensity:threading-scaling

Conversation

ZeroIntensity
Copy link
Member

@ZeroIntensityZeroIntensity commentedMay 26, 2025
edited
Loading

This scales much better. Using this script:

importthreadingimporttimelock=threading.Lock()defscale():a=time.perf_counter()for_inrange(10000000):lock.locked()b=time.perf_counter()print(b-a,"s")threads= [threading.Thread(target=scale)for_inrange(8)]forthreadinthreads:thread.start()

With this applied, I see similar performance to iflock was a local variable:

0.3701289139999062 s0.40727080300075613 s0.41241479399923264 s0.4155945310003517 s0.44201267799962807 s0.4484649369996987 s0.4601175060006426 s0.46210344200062536 s

Prior to this change, I see:

3.425866439999936 s3.5953266010001244 s3.6094701500001065 s3.667731437000157 s4.458146230000011 s4.466017671000145 s4.499206339000011 s4.50090869099995 s

📚 Documentation preview 📚:https://cpython-previews--134762.org.readthedocs.build/

vstinner reacted with rocket emoji
@ZeroIntensity
Copy link
MemberAuthor

I'd appreciate feedback from anyone here. There's no precedent in the standard library for using deferred reference counting at the Python level, so we're in some uncharted waters here. I don't think anyone would have been relying on whenthreading objects are deallocated, so that shouldn't be an issue. Are there other tradeoffs to using DRC that I'm not aware of?

@@ -290,6 +290,25 @@ always available. Unless explicitly noted otherwise, all variables are read-only
This function is specific to CPython. The exact output format is not
defined here, and may change.

.. function:: _defer_refcount(op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Did not take a look overall but not sure it is worth to to expose it through the documentation since this is not a public api.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, I'm a little bit on the fence about it. We do this already for some other functions insys:_is_gil_enabled,_is_interned,_is_immortal, and a few others. Basically, it's a way to expose implementation details that are useful at a Python level.

Overall, I think this is worth documenting. There's no other way to improve scaling from Python right now, and it's similar to using an unstable C API (in the sense that it could be removed in any version).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What about documenting this inhttps://github.com/python/cpython/tree/main/InternalDocs somewhere? I fully agree with removing this from the normal documentation (for the reasons mentioned), but as a user/developer/tester I find the python api to_defer_refcount (and others like_is_immortal) useful and some documentation is always nice.

@corona10
Copy link
Member

corona10 commentedMay 27, 2025
edited
Loading

Personal opinion: I think that we should not expose free threaded implementation to the Python level and end user. Fragmentation is only enough for C API.

@corona10
Copy link
Member

cc@vstinner@colesbury

@ZeroIntensity
Copy link
MemberAuthor

Personal opinion: I think that we should not expose free threaded implementation to the Python level and end user. Fragmentation is only enough for C API.

Reiterating my previous comment: I'm +0.1 for documenting it. I'll happily yield if others are strongly opposed.

@corona10
Copy link
Member

corona10 commentedMay 27, 2025
edited
Loading

I am -1 on documentation, we have bunch of private Python API that are related to C API but we don't expose it the Python documentation since we don't want people depends on it.

ZeroIntensity reacted with thumbs up emoji

@Fidget-Spinner
Copy link
Member

Personal opinion: I think that we should not expose free threaded implementation to the Python level and end user. Fragmentation is only enough for C API.

Same. I truly think we shouldn't be exposing anything that relies on our implementation of refcounting. It's causing a hard enough time for other implementations already. And even causes problems for ourselves when we try to optimize away refcounting.

@ZeroIntensity
Copy link
MemberAuthor

I truly think we shouldn't be exposing anything that relies on our implementation of refcounting. It's causing a hard enough time for other implementations already.

In general, I agree. I've removed the documentation note from this PR.

That said, I do think it's important that we do expose and document some (unstable) APIs for working with the current reference counting implementation, because otherwise people will try to hack it together themselves. For example, Nanobindwas using silly hacks with immortalization before we exposed some better APIs for messing with reference counting. I think that'smuch more dangerous.

Fidget-Spinner reacted with thumbs up emoji

@@ -950,6 +950,7 @@ lock_new_impl(PyTypeObject *type)
if (self == NULL) {
return NULL;
}
_PyObject_SetDeferredRefcount((PyObject *)self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you move the call after initializingself->lock? Same remark forrlock_new_impl() below.

ZeroIntensity reacted with thumbs up emoji
@@ -2653,6 +2654,23 @@ sys__is_gil_enabled_impl(PyObject *module)
#endif
}

/*[clinic input]
sys._defer_refcount -> bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please document it inDoc/library/sys.rst. If it "should not be used", add a clear explanation why it should not be used there. If it's not documented, the lack of documentation doesn't prevent users from using it.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See Ken and Donghee's comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Well, I disageee with them. IMO we should document sys functions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would only be supportive of documenting this if we were allowed to change it in a minor version with no deprecation period. My understanding is thatPyUnstable in the C API allows that, but exposing tosys._x means we are stuck with at least 2 deprecation cycle and recommended 5 deprecation cycles. Users should not rely on this function in the first place except in very specific scenarios.

One way to "bypass" this is make the function a no-op in future versions of Python once we solve this issue altogether. But I don't know what users will rely on by then so I'm a bit worried.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmm, I thought we were allowed to changesys._x things in minor versions without deprecation. If not, that's a problem.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we're getting a bit hung up on this point. We can add or remove the documentation for_defer_refcount later, it's not too important. Does everything else look fine here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Well I am preparing a better proposal for this approach. Give me hours.

Copy link
MemberAuthor

@ZeroIntensityZeroIntensityMay 28, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ok, cool. Feel free to cc me on it.

Something we also need to consider is whether we want to address this for 3.14. Should this general idea be considered a bugfix or a feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think that this is improvement rather than bug fix.

@vstinner
Copy link
Member

With this applied, I see similar performance to if lock was a local variable:

I don't understand well your benchmark. Can you show numbers before/after this change?

@ZeroIntensity
Copy link
MemberAuthor

I don't understand well your benchmark. Can you show numbers before/after this change?

Updated the description. The benchmark is just trying to measure the amount of time spent accessing the object. See alsothe issue.

vstinner reacted with thumbs up emoji

@corona10
Copy link
Member

@Fidget-Spinner Out of curiosity can we just apply deferred refcount with tracing objects that have a possilbity of high concentration?
Then we don't have to modify objects in this way.

@ZeroIntensity
Copy link
MemberAuthor

From my understanding, the issue with automatically applying DRC is that it only lets an object be deallocated during a garbage collection, which breaks code that relies on__del__ (or a weakref finalizer) being called when the object's reference count hits zero.

Maybe we could enable it for objects without__del__, and then disable it ifweakref.finalize is called on it?

corona10 reacted with thumbs up emoji

@Fidget-Spinner
Copy link
Member

@Fidget-Spinner Out of curiosity can we just apply deferred refcount with tracing objects that have a possilbity of high concentration? Then we don't have to modify objects in this way.

You mean like track objects that are frequently shared and apply it to those objects? It would be quite easy to implement, the only problem is like Peter said we can't defer finalizers of things that take up a huge amount of memory and rely on immediate reclamation.

@corona10
Copy link
Member

corona10 commentedMay 27, 2025
edited
Loading

If we can apply it to the conditional case, as Peter said, behavior change would not be an issue. OTAH, memory consumption also would not be an issue if we can provide a disable tracing option where the device has low memory. Most of the production machines have enough memory, so it would not be an issue, IIUC. (Please let me know if it will consume huuuuge memory)

@corona10
Copy link
Member

Side note, the reason I am talking about tracing is that I am now starting to fear that we will attach the deferred ref API all over the codebase due to performance reasons. It would be chaos for us..

Fidget-Spinner reacted with thumbs up emoji

@ZeroIntensity
Copy link
MemberAuthor

It's generally a one-line change to an object's initializer, why would that be chaos?

We could also shift the decision to the user and add a genericthreading.hot, which would enable deferred reference counting (or maybe immortalize non-GC types), but document it as "might do nothing" for forward compatibility. That would act as a safeguard against people constantly asking for deferred reference counting on their favorite objects.

@corona10
Copy link
Member

corona10 commentedMay 27, 2025
edited
Loading

It's generally a one-line change to an object's initializer, why would that be chaos?

No, you are now starting to measure microbenchmarks and then applying them. Can you guarantee that the case is limited?
I can bet that we will find more cases that is the high contention.
How about 3rd party projects? those projects should measure the high contention and then change the code?

A better solution would be that we don't have to care about such a case, and the interpreter applies it automatically. I am fine with just applying for this case, but let's discuss a better solution and think about what we can provide.

Fidget-Spinner and ZeroIntensity reacted with thumbs up emoji

@mpage
Copy link
Contributor

You can work around the scaling issues in the repro pretty easily by passing the lock as an argument toscale:

importthreadingimporttimelock=threading.Lock()defscale(lock):a=time.perf_counter()for_inrange(10000000):lock.locked()b=time.perf_counter()print(b-a,"s")threads= [threading.Thread(target=scale,args=(lock,))for_inrange(8)]forthreadinthreads:thread.start()

I think we should be very cautious about exposing implementation details like deferred reference counting.

corona10 reacted with thumbs up emoji

@ZeroIntensity
Copy link
MemberAuthor

I think we should be very cautious about exposing implementation details like deferred reference counting.

FWIW, I don't think we'reexposing it, per se. Things will "just work" from the user's perspective. I'd really rather the interpreter do it than force weird implementation details, like using locals.

Copy link
Member

@vstinnervstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe you can extractModules/_threadmodule.c changes into a separated PR since these changes don't require addingsys._defer_refcount().

@ZeroIntensity
Copy link
MemberAuthor

Ok._thread is still a documented module--should it be private there too?

@colesbury
Copy link
Contributor

I don't really see the motivation for these particular classes. It seems unlikely to me that someone is checkinglock.locked() simultaneously from multiple threads without also acquiring and releasing the lock, which itself will be a bottleneck. Maybe this makes sense forthreading.Event()?

I agree with@mpage that we should be cautious about exposing implementation details. The motivation here doesn't seem sufficient enough to me to overcome that caution.

@ZeroIntensity
Copy link
MemberAuthor

lock.locked() was just an example to measure the reference count contention and not the internalPyMutex contention.Event is fixed here already too. In practice, people will be doingwith lock or whatever, but that will still have the same refcount contention.

I agree with@mpage that we should be cautious about exposing implementation details. The motivation here doesn't seem sufficient enough to me to overcome that caution.

I don't think we're exposing any actual implementation details, right? The reference count is still hidden tothreading users.sys._defer_refcount is a different story, but I'm going to move that anyway.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@vstinnervstinnervstinner left review comments

@eendebakpteendebakpteendebakpt left review comments

@corona10corona10corona10 left review comments

@Fidget-SpinnerFidget-SpinnerFidget-Spinner left review comments

@colesburycolesburyAwaiting requested review from colesbury

@ericsnowcurrentlyericsnowcurrentlyAwaiting requested review from ericsnowcurrentlyericsnowcurrently is a code owner

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

7 participants
@ZeroIntensity@corona10@Fidget-Spinner@vstinner@mpage@colesbury@eendebakpt

[8]ページ先頭

©2009-2025 Movatter.jp